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History of Two Dimensional Arrays of Processors(1) 


The SOLOMON (Simultaneous Operation Linked Ordinal MOdular 
Network) computer(2) was a proposal for a two-dimensional SIMD 
array of 32 by 32 processing elements, each with a bit-serial 
arithmetic unit. Each PE had a local memory of 4096 bits. This 
computer was never built in the form described in Slotnick's 1962 
paper, but it gave birth to Illiac IV, the ICL DAP ane the 
Goodyear Aerospace MPP. 


The University of Illinois began the design of a SOLOMON-type 
computer in 1966 which became ILLIAC IV‘3/, One quadrant of the 
*machine was built by Burroughs and delivered to NASA in 1972. It 
consisted of an 8 by 8 array of 64-bit, floating point 
processors, each with 2048 words of 64-bit memory. Based on the 
lessons from Illiac IV, Burroughs developed a commercial design 
(the BSP), consisting of 16 processors and a more elaborate 
memory heirarchy‘4),. 


During the mid to late 1970's, International Computers Limited 
(ICL) developed a Distributed Array Processor (DAP)(5)., The DAP 
consisted of a 64 by 64 array of one~bit processors. Sixteen 
processors, with 4096 bits of memory each, were contained on each 
of 256 boards. 


In 1983, Goodyear Aeraspee delivered a Massively Parallel 
Processor (MPP) to NASA‘6). This machine consisted of a 128 by 
128 array of processor elements, which was constructed from 2048 
CMOS integrated circuits, each of which contained eight, single- 
bit processors. Separate memory devices provide 1024 bits of 
memory for each processor. ; 


Paralleling the development of MPP-type architectures has been 
the development of systolic architectures‘7), In these SIMD 
systems, data is continuously pumped through an array of 
processors. : 


Massively parallel architectures are well suited to VLSI 
technology‘8). ‘The component density on integrated circuits has 
been doubling every two years. It is now more cost effective to 
increase computational power by massive parallelism than through 
the use of faster transistors. 


Also, with higher levels of integration, the interconnect wires 
are becoming more expensive than the transistors. Thus, 
architectures utilizing two dimensional arrays of processors, 
with only local communication to nearest neighbor processors are 
easily designed and manufactured in VLSI. 


From 1982 to 1983, Martin Marietta Aerospace, Orlando, Florida, 
developed a VLSI architecture for the Geometric Arithmetic 
Parallel Processor (GAPP). This VLSI component, containing 72 
processor elements with 128 bits of memory for each PE, has been 
Gesigned and manufactured by NCR Microelectronics, Fort Collins, 
Colorado. The commercial product introduction of the GAPP by NCR 
Microelectronics: in 1984 will have a major impact on the 
architecture of special purpose computers. For the first time, 
massively parallel architectures, can be implemented with low 
cost silicon chips. 


In the future, Wafer Scale Integration (WSI) will push computer 
architectures even further in the direction of parallel 
multiprocessor arrays(9). Research on advanced multiprocessor 
architectures is underway in many universities (10) 


10. 
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NCRMAC 


SUBROUTINE: Sum Products 

Todd Davies . 

* Implements the formula: A = (X1)*(Y1L)+(X2)*(¥2)+...+(Xn)* (Yn) 
* AO points to the first element in the X list. 

* Al points to the first element in the Y list. 

* 00 contains the number of products to be summed. 

“ Result of product sum is returned in D2 and low byte of D3. 


Author: 


EQU 


* Write offsets 


WXYCLARA 
ADDXYWX 
WRITE_Y 


EQU 
EQU 
EQU 


* Read offsets 


A_LOW 
A_MID 
A_HIGH 
START: 


LOOP: 


EQU 
EQU 
EQU 


MOVE. W 
CLR. W 
MOVE. W 
MOVE. W 
DBF 
MOVE. W 
MOVE. W 
MOVE. W 
SWAP 
MOVE. W 
MOVE. W 
RTS 


XXXX Base address for MAC chip. 

$3 Write to both X and Y, clear Acc. 
$6 Add X*Y to Acc.. put new data in X. 
$9 Write new data to Y. 

$0 Low word of Acc. 

$1 Bits 16-31 of Acc. 

$2 Bits 32-47 (40-47 extended). 
#NCRMAC, A2 

WXYCLRA*2(A2) Clear X, Y, and Acc. 


(AQ)+, ADDXYWX*2(A2) Mult. /Acc., write next X 


(Al)+, WRITE_Y°2(A2) Write next Y. 


DO, LOOP 


DO, ADDXYWX"2(A2) Last mult../acc. 
A_LOW"2(A2), Ol Fetch low word in acc. 
A_MID*2(A2), 02 Fetch bits 16-31 in acc. 


D2 
D1, O2 


Swap bytes 
Convert to single 32 bit. 


' A_HIGH*"2(A2), D3 Fetch high acc. word 
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FORT COLLINS. COLCRACG 
is 
NCR45CG72 
s PROCESSOR ELEMENT AND DATA BUS’ IDENTIFICATION 
“ TOP VIEW OF PACKAGE - 
3 s gs g 3 8 
< z < = z z 
= — = a = o 
af £5 422 227 82 4? 
A A A A A FY 4 A A A 
i ' { 
wefo le [a |e | |» ft 
Woe | GLOBAL 
a 20 21 22 23 24 25 E2s CONNECTION 
TO EVERY 
Wag Eqs PROCESSOR 
ELEMENT 
Wag 40 41 42 43 | 44 | 45 Sas pad iglesia 
Control Lines 
Weg Ess Co -Ce 
Wer 60 | 61 | 62 63 | 64 | 65 Eas 
Wra 70 | 71 | 72 | 73 | 74 | 75 E75 RAM Adcres 
RAg - RAs 
Wag 80 | 81 | 82 83 | 84 | a5 Eas 
War 30 | 91 | 92 | 93 | 94 | 95 Ess. | 
Wac AO | Al | A2 A3 AS | AS Eas aoa oo “ 
Wag gO | 81 | a2 | B3 | B4 | as oe a 
- A A A A ry A A 
Y toy y Y v i oy 
Qo 94 - = a eo «o r. w (6 
a a [~~] Q a a a a a Qa a = 
2 nw wn Ww o o Ww ew 
5 g S g 3 g 
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a 
NCR45CG72 
2 BLOCK DIAGRAM OF CONNECTIONS SETWEEN 
FOUR PROCESSOR ELEMENTS 
Bidirectianai 
Noninverting ; 
1/O Buffer ? 
OFEN 
ORAIN 
GLOBAL 
OUTPUT 
1, 72-input 
{ 
OE = Output Enabie is an internal connection. _ 
East Outputs enabled whenever Cs = tand Cg 1tand C7 =O (EW:=W) 
West Outputs enabled whenever Cx =O and Cg = 1 and C7 = 1 (EW: =€} 
North Outputs enabled whenever C2 #4 and C3 * 1 and C4= 0 (NS: =S) 
South Outputs enabled whenever C2 =Q and C3 = 1 and C4 = 0 (NS:=N) 
GO is suiled low whenever any NS register contains 1 
= Pa 
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NCA45CG72 
* SCHEMATIC DIAGAAM OF ONE PROCESSOR ELEMENT 


Ava 


i 128 X 1 ot ARAM | 
{ Io’ 4 


] | 
Ag ay Ay dg Ay Ag Ay CLIN © Commenmenen Mere Quieut 





oro orrc. 
ooocerorr-.} 
Orr oceroce- 
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rrr ms ee _——- — te a a TT ee OTN EEL PSY FT aE a I ES be FTE 5 TL ee ED OS FPL AE TT, ES TLE GE TTT I Pe SE Se a 


2 ® M3 = WS 
M3 @ SN = WS 
2 @ SN =Ws 
D+ Md = M8 
Ma+SN= Me 
M3+SN=A9 


0=9 M3 SN = MG 


0=M3 JeSN=AQI 

0=SN JeM3 = Ad 

O=9 M3eSN = AD 
t= M3 ‘0=SN 
b=9'O0=SN 
t=9'0=M3 
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s INSTRUCTION SET 






C:*¢ 

Cc: *RAM 

C:*NS 
¢ Cc: * EW 

c:* cy 

Cc: = 8W 

¢: #9 

Cc: et 


READ 
RAM: = CM 


RAM: #¢ 





RAM: = SM 


x x xe 


x Mm 


“=a oalx xx x * xX xX KM KK KK KK x x KKM KM] X 





~o~oalxxx x XX x KIM KM KK MME KK KKK KEK 








~e se + O00004xX xX XxX XX KK KK KK K KIM 


x «x KK x 






~~oe0O7- ~Q 0] KK KKM KIK KK KK RK KEK 


x KK 






-~-ao--o0o+7+0-01K * XX M KM KIM MM KK OM KY 


x KM x 


Control Lines 
Ca Co Ca Cr Cg Cg Cy Cz Cz Cr Co 


x MX «KEK KK MK KK K KM 






~a-=7=--=- 000 07% K K KK K KIX 






- oo" = GAGA TMK KX KX K K KX KIX 


MMM MITK MM KM MK 









~o -AO-~ OEX KK MK X KIX 


x KK MK MK KK KK 





~-—- ~~ oO Q0 a[ xX 


x x KX KEK KK KK WK KEK KK KK KM 


» 


~oo--7---00]X x 


x x «x «KEK KKK KK KK EK KK KM KK 


xxx xEM KM KM MK MK KML K MK MK 


x «MxM KIM KKK KKK KR K KK KK KEK KK KK KM 
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NCR4SCG72 


Description 


MICRO-NOP 
LOAD CM FROM RAM 


MOVE FROM CMS 
INTO CM 


LOAD OINTO CM 


MICRO-NOP 

LOAD NS FROM AAM 
MOVE FROM N INTO NS 
MOVE FROM S INTO NS 
MOVE FROM EW INTO NS 
MOVE FROM CINTONS 
LOAG GINTO NS 


MICRO-NOP 

LOAG EW FROM RAM 
MOVE FRON E INTO EW 
MOVE FROM W INTO EW 
MOVE FROM NS INTO EW 
MOVE FROM C INTO EW 
LOAD OINTO EW 


MICRO-NOP 

LOAO C FROM RAM 

MOVE FROM NS INTO C 
MOVE FROM Ew INTO C 
LOAD C FROM CARRY 
LOAD ¢ FROM BORROW 
LOAC INTO C 
LOAOTINTOC | 


READ FROM RAM 
LOAD RAM FROM CM 
LOAD RAM FROM C 
LOAD RAM FAOM SUM 
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TIMING DIAGRAM 





| Vin = 2.0V 
Vir 2 O.8V 


CLOCK 


5 emt scon Hl] aD (2 MK = MU Xe 
- «=f THO ADE-DS- 

7 a. = Pe eas 
ie wen 





NOTE: 1,2,3 refer to the staging sequence of instruction, data in and data out. 





miGRV=SLescCrACMCS 


Sea 5 ~~ 3 EI LTO ea EE, oe ee DB 





GAPP SYSTEX] JNJIPLEXIENTATION 


EAST -TO -YEST 
PROGAAMASLE YRAP-AARCUNO 


‘? CN GLCCAL OUTPUT 














3 
: =3 GAPP ARRAY GAPP 
= Z OF PROCESSOR SYSTEM 
: as. CONTROLLER 


) CORNER CONTROL 


oa CoN 
VIDEO { * CORNER TURN 
| VIDEO 
_ oN | os { LINE BUFFER > 
= OU 
DATA IN DATA OUT 


a HOST GPU BUS 
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TS QA\PP ARRAY AND BUFFER 
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RAM OF 
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= PROCESSOR 
=z Li ARRAY 
Ze 


tput 


ew eae 
Line of zi Z ge *: Input line 
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AU 


ANN 
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WAY 
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GAPP System Controller I. 
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ORGANIZATION OF DATA IN THE GAPP ARRAY 


eWORD SERIAL, BIT PARALLEL DATA liUST BE 
CONVERTED TO WORD PARALLEL, BIT SERIAL FORMAT 


eTAG BIT MAINTAINED BOTH IN A MEMORY LOCATION 
AND IN NS REGISTER | 


eRESPONDER SIGNAL CREATED FROM THE NS REGISTER 
sEDGES OF ARRAY ARE TREATED AS DUMMY CELLS 


eLAYOUT OF DATA IN MEMORY IN ONE PE 


RAM ADDRESS = 0---4------- 4+M ---- 125 126 127 
I-M BITS-| t 
CELL CELL+ TAG 
M-! 
(MSB) (LSB) 
lew 8 December 1984 


NOTATION OF ALGORi THMS 


efUNCTION NAMElare! _arg2} 
BRIEF DESCRIPTION OF THE FUNCTION 


DESCRIPTION OF THE ALGORITHM 


2 ; 
C LANGUAGE LIKE SYNTAX USED TO DESCRIBE 
CONTROL STRUCTURES THAT ARE EXECUTED BY THE 
CONTROL UNIT AND GAPP MNEMONICS THAT ARE 
PASSED TO THE ARRAY VIA THE CONTROL UNIT 


eIIME 


ASSUMES THAT THE ARRAY IS ALWAYS RUN WITH NO 
WAIT STATES ADDED BY THE CONTROL UNIT 


lew | 9 December 1984 


IMPLEMENTING THE COMPARE 


eEFFECTIVE BIT WIDE IMPLEMENTATION THAT CAN BE 
EXPANDED FOR MULTI BIT COMPARE 


eLOMPARE(addr, value), 

FOR EVERY RESPONDER, COMPARE SETS THE TAG BIT 
IF THE VALUE RESIDING AT THE RAM addr MATCHES 
THE va/ue ARGUMENT. addr iS A NUMBER BETWEEN 0 
AND 127. va/ve lS A BOOLEAN ARGUMENT. 


eMETHOD: 

LOAD THE NS REGISTER WITH THE VALUE STORED AT 
addr.LOAD THE EW REGISTER WITH va/ue. EXNOR THE 
EW AND NS REGISTERS AND PLACE THE RESULT IN THE 
NS REGISTER. AND THIS WITH THE TAG BIT IN RAM 
AND PLACE THE RESULT IN THE NS REGISTER AND IN 
THE TAG LOCATION OF RAM. THIS BECOMES THE NEW 
TAG. 


eALGORITHM 
/* LOAD THE NS AND EW REGISTERS °*/ 
IF( va/ue == 0) 

EW=0: NS=RAM( addr); C=1; 
ELSE{ 

C=1: 

EW=C; NW=RAM( acdr); 


} 

/*EXNOR INTO THE NS REG*/ 
NS=RAM(TEMP); RAM(TEMP) = SM; 
/* AND RESULT WITH TAG */ 
EW=RAM(TAG); C=0; 

C=CY; 

/*PLACE RESULTS IN RAM AND NS*/ 
RAM(TAG )=C; NS=C; 


lew 10 December 1984 


IMPLEMENTING THE EQUALITY SEARCH 


oLXACT MAIC 
SEARCH THE RESPONDERS OF THE ARRAY FOR AN 
EXACT MATCH TO THE MASKED COMPARAND REGISTER 


eMETHOD: 

USE THE COMPARE PRIMITIVE TO MATCH EACH BIT OF 
THE MASKED COMPARAND REGISTER WITH THE WORDS 
STORED IN THE ARRAY 


eALGORITHM; 
/* LOOP FOR EVERY BIT IN THE WORD */ 
for Gi=O; i<cm; it+){ 
if(mask(i) == 1) then { | 
COMPARE(cell+i, Comparand(i)); 
| 
} 


el IME: 

M * 5.5 CYCLES, WHERE M IS THE NUMBER OF BITS AND 
THE COMPARISONS ARE EQUALLY DISTRIBUTED 
BETWEEN O AND 1. 


lew 1 December 1984 


pa 


Vs 


WRITING INTO THE ARRAY 


elT IS POSSIBLE TO LOAD THE ENTIRE ARRAY VIA THE 
CM BUS BUT THIS IS NOT VERY EFFICIENT WHEN ONLY 
ONE OR A FEW CELLS NEED TO BE WRITTEN IO. 


oWRiiTkladdr, valuel 
WRITE THE BOOLEAN yva/ue INTO THE addr IN THE RAM 
OF THE RESPONDING ELEMENTS. 


eMETHOD: 

IF THE TAG IS SET THEN PLACE va/ue IN LOCATION 
addr. IF THE TAG IS NOT SET THEN PLACE THE 
CURRENT CONTENTS OF addr IN LOCATION adar. 


RESTORE THE TAG AT THE END OF THE ALGORITHM TO 
ENSURE THAT MULTIPLE INVOCATIONS WORK 
PROPERLY. | 


C4 
~ 
» 


lew 12 December 1984 


WRITING INTO THE ARRAY(CONTINUED) 


eALGORITHM: 
/*Load contents of addr into ew */ 
EW = RAM( addr); C=0; 


/*Produce logical AND of TAG , assumed to be in NS, */ 
/* with the contents of addr */ 
C =BYW; 

RAM(TEMP )=C; C=1; 


_ f*Load ee EW, tag assumed to be in NS */ 
if( va/ue == 0) 
EW= 0: C=0; 
elise 
EW=C; C=0; 


/* Logically AND va/ueand tag */ 
C=CY: 


/*load intermediate values in anticipation of OR*/ 
NS=C; EW=RAM(TEMP); C=1: 


/*perform OR and restore tag*/ 


C-CY; NS=RAM(TAG): 
RAM( addr)=C: 


lew 13 December 1984 


= READING FROM THE ARRAY 


eSHIFT OUT THE ENTIRE ARRAY VIA THE CM BUS 
-EFFICIENT ONLY IF A LARGE PORTION OF THE 
ARRAY IS OF INTEREST 


eUSE THE COMPARE FUNCTION 
-THE COMPAREladdr,_ 2) FUNCTION PLACES THE 
DATA AT addr ON THE RESPONDER SIGNAL WHERE 
IT CAN BE SHIFTED INTO THE aaa an 
REGISTER 


-THE COMPARE(addr._1/ FUNCTION PLACES 
INVERTED DATA ON THE RESPONDER SIGNAL 


7 -THIS IS EFFECTIVE WHEN A COMPAPEIS 
: REQUIRED IN ADDITION TO THE READ 


bh 
lew 14 December 1984 


READING FROM THE ARRAY(CONTINUED) 


ekEADaddrs 

PLACE THE DATA AT addr OF RESPONDING ELEMENTS 
IN THE NS REGISTER SO THAT IT PROPAGATES TO THE 
RESPONDER OUTPUT AND CAN BE SHIFTED INTO THE 
COMPARAND REGISTER. 


eMETHOD 

LOGICALLY AND THE RAM ADDRESS ‘TAG' WITH THE 
DATA AT addr AND PLACE THE RESULTS IN THE NS 
REGISTER. BY USING THE TAG STORED IN RAM, 
REPETITIVE CALLS TO THIS FUNCTION WILL WORK 
PROPERLY BUT THE TAG IN THE NS REGISTER IS 
GARBAGED. 


eALGORITHM: 
/*Load ns with the TAG*/. 
NS=RAM(TAG); 


/*Load ew with data */ 
EW=RAM( addr); C=0; 


/*AND tag and data °/ 
C=CY; 


/*place results in ns */ 
NS=C; 


eT IME 
4 cycles 


lew 15 December 1984 
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ADDRESSING SPECIFIC ELEMENTS 
eLOAD EACH PE WITH A UNIQUE ADDRESS THAT WILL 
BE STORED IN RAM 


ePERFORM AN EYACT_MATCH SEARCH FOR THE 
ADDRESS TO SELECT A SINGLE ELEMENT 


eREADTHE CONTENTS OF THE RESPONDING ELEMENT 
eFOR A 516 X 516 ARRAY OF ELEMENTS: 
~EXACT_MATCH --> 114 CYCLES 
(BASED ON 266,256 ELEMENTS REQUIRING 19 
BITS OF ADDRESS) 
-READ --> 32 CYCLES 


-ASSUMING A 10 MHZ CLOCK THE ENTIRE 
OPERATION TAKES 14.6 piSecs 


lew 16 December 1984 


WHAT IS AN ASSOCIATIVE PROCESSOR? 


eINCLUDES ALL OF THE CAPABILITIES OF AN 
ASSOCIATIVE MEMORY 


eCAPABLE OF PERFORMING LOGICAL OR ARITHMETIC 
OPERATIONS ON ALL DATA WORDS OF THE MEMORY IN 
PARALLEL 


eASSOCIATIVE PROCESSORS ARE INHERENTLY SINGLE 
INSTRUCTION MULTIPLE DATA (SIMD) MACHINES 


eGENERALLY, SEARCHES ARE PERFORMED TO | 
INDENTIFY DATA ITEMS OF INTEREST (USING AM 
FEATURES) AND THEN THESE ITEMS ARE OPERATED ON 
USING THE AP FEATURES 


lew 4 December 1984 


CONCLUSION 
eTHE GAPP MAY BE USED IN ASSOCIATIVE MEMORY 
DESIGNS BY: 
-EMULATING BIT PARALLEL OPERATION 


~PAIRING IT WITH THE APPROPRIATE CONTROL 
STRUCTURE 


-USING THE GLOBAL OUTPUT AS AN OUTPUT PORT 


eTHE GAPP MAY BE USED IN ASSOCIATIVE PROCESSOR 
DESIGNS BY: 


-UTILIZING THE SIMD NATURE OF THE DEVICE 


-UTILIZING THE POWERFUL INSTRUCTION SET 


be 
lew 16 December 1984 
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MICROELECTRONICS 
FORT COLLINS, COLORADO 





DILATION 


Silation is the expansion of an image in the algorithmically determine! 
direction:s). bs 


Method: Extend single bit plane image by shifting the bit and *or~-ing’ 
it with its neighbor. 


Example 1: Dilate image in (1,0) direction (north). 


NS: =RAM(0) C:21; /* Load NS with input image x/ 
NS:=S EW:=NS; /* Shift image north one pixel */ 
C:=CY; /* *OR’ image with shifted image *- 
RAM(1)}:=C; /* Store results in RAM x*/ 


Example 2: Dilate image in (+-1,0) (0,+~1) directions {‘north,south,east, 
and west), where the pixel neighborhvod is defined by: 


D 
C AB 

E 
NS: =RAM(0) C:=l; /* Load (A) into NS */’ 
NS:=S EW: =RAM(0); /* Shift (E) into NS and ‘{‘A) into EW x / 
C:=CY NS: =RAM(0O): /* C=(A)+(E}, Load (A) into NS *- 
EW:=C C:s1 NS:&=N; /* EW=(A}+(E), Shift (D) into NS */ 
Ci2cy: /* C=(A)+(D)+(E) x/ 


NS:=C EW:=RAM(0O) C:=1; /* NS=(A)+(D)+(E), Load (A} into NS ¥*/ 


EW: =E; /* Shift (B) into NS *- 

C:=]CY: (* C=(Ai+(B)+(D)+(E) x, 

NS:=C C:=l1 EW: =RAM(0O); /* NS=(Aj+(B)+(D!+(E}, Load {A\ into EW 
C:=CyY: 7X C=/A} 4B 4°, Te EY. «-: , 
RAM‘ 13:=C: ** Store result in RAM * 
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WHAT ISA 
CONTENT ADDRESSABLE MEMORY? 


STORED DATA ITEMS ARE ACCESSED BY A MATCH BETWEEN 
A SEARCH WORD (THE "KEY') AND THE SPECIFIED PORTION 
OF THE CELL MEMORY CONTENTS. THE REMAINING MEMORY 
CONTENTS OF THE CELL ARE USED AS THE DATA. 


ALL STORED DATA ITEMS ARE SEARCHED IN PARALLEL. 


THE MEMORY'S RESPONSE TO A MATCH VARIES ACCORDING 
TO THE DESIGN AND PURPOSE OF THE MEMORY. 


CONTENT ADDRESSABLE MEMORIES ARE OFTEN CALLED— 
—ASSOCIATIVE MEMORY 
—DATA ADDRESSED MEMORY 
—PARALLEL SEARCH MEMORY 


‘i 


CAM OPERATION 


PRIMITIVES; SET RESPONDER, READ RESPONDER, WRITE RESPONDER 
COMPARE RESPONDER, COUNT RESPONDER, FINO FIRST RESPONDER 


COMPARAIND REGISTER | 


: f \ 
4 MASK REGISTER 
LY 


| 
ARRAY CONTROLLER | 
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MEMORY ARRAY 
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ORGANIZATION OF DATA IN THE GAP? 
ARRAY 


WORD SERIAL, BIT PARALLEL DATA MUST BE CONVERTED TQ 
WORD PARALLEL, BIT SERIAL FORMAT. 








TAG BIT IS MAINTAINED IN BOTH A MEMORY LOCATION ANT IN 
THE NS REGISTER. 


RESPONDER SIGNAL !S CREATED BY THE NS REGISTER USING THE 
GLOBAL OUTPUT SIGNAL. 


MEMORY ALLOCATION WITHIN EACH GAPP PE: 


Pa ae a eee ee 126 127 


RAM ADDRESS = 0...... Kix ce 
{-K BITS-{-" BITS-} 
| KEY | DATA | TAG 
(HSB) (LSB) 
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NOTATION OF ALGORITHMS 


efUNCTION NAME(are!, argZ/, 
BRIEF DESCRIPTION OF THE FUNCTION 


eMETHOD: 
DESCRIPTION OF THE ALGORITHM 


3 
C LANGUAGE LIKE SYNTAX USED TO DESCRIBE 
CONTROL STRUCTURES THAT ARE EXECUTED BY THE 
CONTROL UNIT AND GAPP MNEMONICS THAT ARE 
PASSED TO THE ARRAY VIA THE CONTROL UNIT 


eT IME 
ASSUMES THAT THE ARRAY IS ALWAYS RUN WITH NO 
WAIT STATES ADDED BY THE CONTROL UNIT 


IMPLEMENTING THE COMPARE 


eEFFECTIVE BIT WIDE IMPLEMENTATION THAT CAN BE 
EXPANDED FOR MULTI BIT COMPARE 


slLOMPARL (addr, value): 

FOR EVERY RESPONDER, COMPARE SETS THE TAG BIT 
IF THE VALUE RESIDING AT THE RAM addr MATCHES 
THE va/ue ARGUMENT. addr IS A NUMBER BETWEEN 0 
AND 127. va/ve IS A BOOLEAN ARGUMENT. 





eMETHOD: 

LOAD THE NS REGISTER WITH THE VALUE STORED AT 
addr. LOAD THE EW REGISTER WITH va/ue EXNOR THE 
EW AND NS REGISTERS AND PLACE THE RESULT IN THE 
NS REGISTER. AND THIS WITH THE TAG BIT IN RAM 
AND PLACE THE RESULT IN THE NS REGISTER AND IN 
THE TAG LOCATION OF RAM. THIS BECOMES THE NEW 
TAG. , 


eALGORITHM 
/* LOAD THE NS AND EW REGISTERS */ 
IF( va/ue == 0) 

EW=0: NS=RAM( addr); C=1; 
ELSE{ 

C=1; 

EW=-C: NW=RAM( sacar): 


} 

/*EXNOR INTO THE NS REG*/ 
NS=RAM(TEMP); RAM(TEMP) = SM; 
/* AND RESULT WITH TAG *7 
EW=RAM(TAG); C=0: 

C=CY; 

/*PLACE RESULTS IN RAM AND NS*/ 
RAM(TAG )=C; NS=C: 


66 


IMPLEMENTING THE EQUALITY SEARCH 


oLKACT MALE 
SEARCH THE RESPONDERS OF THE ARRAY FOR AN 
EXACT MATCH TO THE MASKED COMPARAND REGISTER 


eMETHOD: 

USE THE COMPARE PRIMITIVE TO MATCH EACH BIT OF 
THE MASKED COMPARAND REGISTER WITH THE WORDS 
STORED IN THE ARRAY 


eALGORITHM: 
/* LOOP FOR EVERY BIT IN THE WORD */ 
for (i=0; i<cm; i++ ){ | 
if(mask(i) == 1) then { 
COMPARE(celiti, Comparand(i)); 
} 
} 


eTIME: , 

M * 5.5 CYCLES, WHERE M IS THE NUMBER OF BITS AND 
THE COMPARISONS ARE EQUALLY DISTRIBUTED 
BETWEEN 0 AND |. 


WRITING INTO THE ARRAY 


elT IS POSSIBLE TO LOAD THE ENTIRE ARRAY VIA THE 
CM BUS BUT THIS IS NOT VERY EFFICIENT WHEN ONLY 
ONE OR A FEW CELLS NEED TO BE WRITTEN TO. 


e¥RTEaddr, value) 
WRITE THE BOOLEAN va/ue INTO THE addr IN THE RAM 


OF THE RESPONDING ELEMENTS. 


eMETHOD: 

IF THE TAG IS SET THEN PLACE va/ue IN LOCATION 
addr. IF THE TAG IS NOT SET THEN PLACE THE 
CURRENT CONTENTS OF addr IN LOCATION addr. 


RESTORE THE TAG AT THE END OF THE ALGORITHM TO 
ENSURE THAT MULTIPLE INVOCATIONS WORK 
PROPERLY. 


6 


WRITING INTO THE ARRAY(CONTINUED) 


eALGORITHM: 
/*Load contents of addr into ew */ 
EW = RAM( addr); C=0; 


/*Produce logical AND of TAG , assumed to be in NS, */ 
/* with the contents of addr */ 
C=BY; 

RAM(TEMP)=C; C=1; 


/*Load va/ue into EW, tag assumed to be in NS */ 
if( vasue == 0) 

EW=0; C=0; 
else 

EW=C; C=0; 


/* Logically AND va/ueand tag */ 


CC; 


/*load intermediate values in anticipation of OR*/ 


NS=C; EW=RAM(TEMP ); C=1; 


/*perform OR and restore tag*/ 
C=CY; NS=RAM(TAG); 
RAM( addr)=C; 


READING FROM THE ARRAY 


eSHIFT OUT THE ENTIRE ARRAY VIA THE CM BUS 
~EFFICIENT ONLY IF A LARGE PORTION OF THE 
ARRAY IS OF INTEREST 


eUSE THE COMPARE FUNCTION 
-THE COMPARE(addr.@/ FUNCTION PLACES THE 
DATA AT addr ON THE RESPONDER SIGNAL WHERE 
IT CAN BE SHIFTED INTO THE COMPARA 
REGISTER 


-THE COMPAR E(addr,_{/ FUNCTION PLACES 
INVERTED DATA ON THE RESPONDER SIGNAL 


-THIS IS EFFECTIVE WHEN A COMPARE IS 
REQUIRED IN ADDITION TO THE READ 
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READING FROM THE ARRAY(CONTINUED) 


efEAD(adar/, 

PLACE THE DATA AT ader OF RESPONDING ELEMENTS 
IN THE NS REGISTER SO THAT IT PROPAGATES TO THE 
RESPONDER OUTPUT AND CAN BE SHIFTED INTO THE 
COMPARAND REGISTER. 


eMETHOD 

LOGICALLY AND THE RAM ADDRESS ‘TAG* WITH THE 
DATA AT addr AND PLACE THE RESULTS IN THE NS 
REGISTER. BY USING THE TAG STORED IN RAM, 
REPETITIVE CALLS TO THIS FUNCTION WILL WORK 
PROPERLY BUT THE TAG IN THE NS REGISTER IS 
GARBAGED. 


eALGORITHM: 
/*Load ns with the TAG*/. 
NS=RAM(TAG): 


/*Load ew with data */ 
EW=RAM( addr); C=0; 


/*AND tag and data °/ 
C=CY; 


/*place results in ns °/ 
NS=C; 


eT IME 
4 cycles 


ADDRESSING SPECIFIC ELEMENTS 
eLOAD EACH PE WITH A UNIQUE ADDRESS THAT WILL 
BE STORED IN RAM 


ePERFORM AN EXACT_MA TCH SEARCH FOR THE 
ADDRESS TO SELECT A SINGLE ELEMENT 


e READTHE CONTENTS OF THE RESPONDING ELEMENT 
eFOR A 516 X 516 ARRAY OF ELEMENTS: 
~EXACT_MATCH --> 114 CYCLES 
(BASED ON 266,256 ELEMENTS REQUIRING 19 
BITS OF ADDRESS) 
-READ --> 32 CYCLES 


-ASSUMING A 10 MHZ CLOCK THE ENTIRE 
OPERATION TAKES 14.6 Secs 


WHAT IS AN ASSOCIATIVE PROCESSOR? 


eINCLUDES ALL OF THE CAPABILITIES OF AN 
ASSOCIATIVE MEMORY 


eCAPABLE OF PERFORMING LOGICAL OR ARITHMETIC 
OPERATIONS ON ALL DATA WORDS OF THE MEMORY IN 
PARALLEL 


eASSOCIATIVE PROCESSORS ARE INHERENTLY SINGLE 
INSTRUCTION MULTIPLE DATA (SIMD) MACHINES | 


eGENERALLY, SEARCHES ARE PERFORMED TO 
INDENTIFY DATA ITEMS OF INTEREST (USING AM 
FEATURES) AND THEN THESE ITEMS ARE Crees ON 
USING THE AP FEATURES 


CONCLUSION 


eTHE GAPP MAY BE USED IN ASSOCIATIVE MEMORY 
DESIGNS BY: 


-EMULATING BIT PARALLEL OPERATION 


-PAIRING IT WITH THE APPROPRIATE CONTROL 
STRUCTURE 


-USING THE GLOBAL OUTPUT AS AN OUTPUT PORT 


eTHE GAPP MAY BE USED IN ASSOCIATIVE PROCESSOR 
DESIGNS BY: 


-UTILIZING THE SIMD NATURE OF THE DEVICE 


-UTILIZING THE POWERFUL INSTRUCTION SET 
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OWING SEQUENCE OF ee 
CLEAN PATTERN 
FIND CONNECTING POINTS 
CLEAN PATTERN 
FIND END POINTS 


THIN PATTERN 


TheiS METHOD iS DESCRIGED IN SETAIL IN A SAFER WRITTEN By EST 
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RELAXATION 
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SCLUTION IS CESIRABLE. EXAMPLES OF SUCH DRCELENS INCLULE: 
* HEAT FLOW 
* INCOMPRESSABLE FLUID FLOW 


~* ELECTRICAL POTENTIALS 


GIVEN THAT THE BOUNDRY CONDITIONS ARE FIXED, THE yeLJe AT 
EACK POINT IN THE SURFACE CR VOLUME MAY BE ESTIMATES ov 
AVSPAGING THE VALUES OF ITS NEISHBCEINS POINTS. 
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RELAXATION ON THE GAPP 


* 


AN APSAY OF GAPP DEVICES MAY BE USED TO SOLVE A Twit SoMENS HIN AL 
RE_AXATION UTILIZING THE ies NEIGHBOR CONNECTIONS, THE “Ew 
VALUE AT EACH POINT IS SIMPLY THE AVERAGE OF THE YA LES OF LTS 
E'GHT NE'GHBOPS. 

THE PARALLEL ARCHITECTURE OF THE GAPP ALLOWS ALL POINTS TO BE 
CALCULATED SIMULTANEOUSLY. 

THE ALGORITHM 13 FINISHED WHEN THE PREVICUS VALLE OF EvERy Sr. 
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GAPP ARRAY 
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( 32 GAPP DEVICES ) 
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CAPPIX I (EVALUATION GAPP) - is a plug compatible processor board that 
interfaces directly to the IBM XT or AT BUS. It is designed to be used as a tool 
to learn GAPP programming and runs in conjunction with NCR "GAPSYS” software 
package. The board contains 144 CPUs and 18432 bits of memory. 


CAPPIX II (INTELLIGENT GAPP) - is a Plug compatible general purpose image 
processing board that interfaces directly to the IBM XT or AT BUS. The board is 
programmable and contains its own on board controller and memory. The total CPUs 
are 144, upgradable to 288 and 36864 bits of memory, plus 4Kx16 Data Ram. 


144 CPUs - 


288 CPUs - 
Software supplied: 


IBM PC Programs: Download Microcode, Download/Upload data, Corner Turn 
Debugger, Full Image Swap 


Microcode Programs: Corner Turn, Arithmetic (+, «, ®, /), GAPP Initialization 
GAPP State Output, Convolution 


GAPSYS SOFTWARE - is a package for writing and debugging programs to be run 
on CAPPIX I hardware. 


LIVING SOFTWARE .- is a software package which simulates GAPP on the IBM 
Personal Computer. Users can build their own application library while utilizing all 
the interactive facilities of Forth language. 


MACRO-META ASSEMBLER FOR MS-DOS 


RELOCATABLE/LINKABLE MACRO-META ASSEMBLER 
FOR MS-DOS 


GAPP Chips available to CompuPix Customers for immediate delivery...Call for Price 
Quotes. 


*GAPP is a Registered Trade Mark of NCR 
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» APPLICATIONS 


© PATTERN RECOGNITION 


Correlation 

Sobel Transform 
Spoke Filter 
Template Matching 
Automated Inspection 


Machine Vision 


@ PARALLEL DATA PROCESSING 


Convolution 


Matrix Operations 


® Histogram 


@ Search and Sort 


s GENERAL DESCRIPTION 


NCR45CG72 


GEOMETRIC ARITHMETIC PARALLEL PROCESSOR 


® IMAGE PROCESSING 


Image Enhancement 

Edge Detection 
2-Dimensional Convolution 
Compression 

Spatial Filtering 
Differential Imaging 


e ASSOCIATIVE PROCESSOR 
* Content Addressable Memory 


® Limit Search 


@ Hamming Distance 


The NCR45CG72 is a two-dimensional systolic array processor chip. It is a mesh-connected six by twelve arrangement of 
1 bit processor elements. Each processor element can communicate with four neighbors: N,E,S, and W. Each processor 
element is composed of a bit serial ALU, 128 X 1 bit RAM and 4 single bit latches: Three latches hold inputs to the ALU 
and the fourth latch allows |/O through the cell without interrupting the ALU, i.e. !/O operations are overlapped with 
computation. 
The cascadeability of the GAPP allows system designers to implement arrays ot processors of arbitrary size in multiples of 
6 X 12 elements. 


s FEATURES 


® CMOS systolic array with 72 processors per chip 
® 6X 12 array of bit serial processor elements 


© Single instruction multiple data stream architecture — all processor elements operate in parallel 


® GAPP devices are fully cascadeable 


@ System throughput increases linearly with number of processor elements in the system 


* Broadcast global input and output 


@ Separate 1/0 bus = overlapped 1/0 and computation 
@ 128 Bits of static RAM per processor 
e VLSI double layer metal CMOS technology 


@ §00 milliwatts power at 10 MHz 


sc ————— 
Copyright ©1984 by NCR Corporation, Dayton, Ohio, USA. All rights reserved. Printed in USA. 


NCR45CG72 
» ABSOLUTE MAXIMUM RATINGS 


Supply Voltage, Vpp..- 1... eee eee ee +7V 
Voltage on any pin with respect 

TO GPOUAG) io sie sos Soe eae aes —0,3 to Vop + 0.3V 
Storage temperature. ............. —65°C to 150°C 
CAUTION 


Stresses above “absolute maximum ratings’ may result 
in damage to the device. Functional operation of devices 
at the “absolute maximum ratings’ or above the recom- 
mended operation conditions stipulated elsewhere in this 
specification is not implied. 


1. CMOS Devices are damaged by high energy electrostatic discharge. Devices must be stored in conductive foam or with 
all pins shunted, Precautions should be taken to avoid application of voltages higher than the maximum rating. 


2. Remove power before insertion or removal of this device. 


s OPERATING CHARACTERISTICS 


Supply Voltage 
Supply Current (10 pF loads) 


45CG72-2 
45CG72-1 


Input Low Voltage 
Input High Voltage 


VI 
vi 
Vo 
Vo 

z 
Cin 
Co 


Leakage Current on any 
Input or 1/0 Pin 





NCR has a license from Martin Marietta Aerospace to manufacture and market GAPP devices only for commercial and industrial applications. 
GAPP devices may not be sold by NCR to the military market and may not be incorporated into equipment for the military market without 
authorization from Martin Mariette Aerospace, Orlando, Floris. “Military Market” shail mean the market defined by procurements of 
product made directly or indirectly for the U.S. Department of Defense or any other U.S. Government agency or any foreign governments, 
for use in equipment intended for military application and, technically characterized for such application by construction, extreme environ- 
ment capability, electronic circuit adaptations for specifically designed military equipment, and or being type designated by any legaily 
authorized government or joint government-industry body, which can confer such designations. 


Henne eee ere 
2 NCR reserves the right to make any changes or discontinue altogether without notice with respect to any hardwere or software product or the 
technical content herein. 





NCR45CG72 
» TIMING DIAGRAM 
tey 
tou 
Vin = 2.0V 
CLOCK Vie = 0.8V 


re 
wt i 
rc 


EI ms i. Ne N= VO 
wens. Le Xe) IN) 
Go tT vac 2 ee 
a ake 
we TT} a) {a} (a) 


NOTE: 1,2,3 refer to the staging sequence of instruction, data in and data out. 


» AC CHARACTERISTICS 













MAX UNITS 
SS a Fs Ta ET CM a [Ea A 
euoek cow tee 10 s000 [06008 os 
eLoeK nigh ten | 100 S00 [805000107 [os 
Fn ag RE” RRMA 1 Se MT 
Sg Ta RN OS (a ST (7S 
COUTeUTSIENABLEDI ee og een 
Glos OUTPUTLOW! = eae eo fou |e 0 
GLOBAL OUTPUTTAISTATE | tcor | 10 | 50] 10 | 3 | = | 
meuourr: = tee ao |e) 





NOTE: (1) d.c. by design; tested at 5 jisec. 








NCR45CG72 
» PROCESSOR ELEMENT AND DATA BUS IDENTIFICATION 


Weo 


TOP VIEW OF PACKAGE 


F = S z - = 
= 8 = 5S = 8 > $8 > 3 = 8 
oO a oOo z oOo z Oo 2 oOo 2 Oz 


~d 
i=) 
. 
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e 
Ww > 
Ps 
~d 


Oo 
e 
wW 


on 


CMSao 
Spo 
CMSp1 
Sai 
CMS B4 
Spa 
CMS a5 
Sas 








Eos 

Eis 

Eos GLOBAL 
CONNECTIONS 
TO EVERY 

E35 PROCESSOR 
ELEMENT 
IN THE ARRAY 

Eas 


Control Lines 
Co -Cr 









RAM Address 
RAg - RAg 






Giobal Output 
GO 





NOTE: This numbering scheme may be extended in systems which contain more than one GAPP device. 


PIN LABELS 

Woo — Weo WEST DATA BUS 
Eos — Eps EAST DATA BUS 
Noo — Nos NORTH DATA BUS 
Sao — Ses SOUTH DATA BUS 
CMSgo — CMSps5 INPUT BUS 
CMNoo — CMNos OUTPUT BUS 
RAg — RAg RAM ADDRESS BUS 
Co —Co CONTROL LINES 

— INSTRUCTION BUS 

— GLOBAL DATA INPUT BUS 
GO GLOBAL OUTPUT LINE 





—S 2 co 


—=—Ss Ss 


—3 








NCR45CG72 
« BLOCK DIAGRAM OF CONNECTIONS BETWEEN 
FOUR PROCESSOR ELEMENTS 
[Noo | Jems} | Nox 
A Bidirectional 
AY cee hx 
«| | cs eet 
== eT eg ce iceee OEE 
EW GLOBAL 






+f ae 
CMS N/S 


1) 72-input 
N/S 


i | or gate 
Fe a <CEiiroee PeRe s ealel  .1 


i 
“ a 
CMS N/S lk 


OE = Output Enable is an internal connection. 

East Outputs enabled whenever EW:=W 

West Outputs enabled whenever EW:=E 

North Outputs enabled whenever NS:=S 

South Outputs enabled whenever NS:=N 

GO is pulled iow whenever any NS register contains 1 
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* SCHEMATIC DIAGRAM OF ONE PROCESSOR ELEMENT 










CONTROL 
LINES 
MULTI- 
Co PLEXORS 












Cy REGISTERS 





Ya 


Ag Ay Az AzgAag F 
ADDRESS LINES 
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NS = NS Register 

EW = EW Register 

C = Carry Register 

CM = Communications Register 

CMS = Communications South Input | 
CMN = Communication North Output 
SM = Sum 

BW = Borrow 7 
CY = Carry 

GO = Globat Output 


- 
q... 





® INSTRUCTION SET 


READ 
RAM: =CM 
RAM: =C 
RAM. = SM 


-co Ox * xk K KM MK KM I[K MM KK KK LK KKK KK KK 


_ 


=o 7201%% XX KM KK KIRK KK MK KK LK KK RK MK KK 


=-=---4+- 20 00 OFM KKM MK KM MILK KKK MK KCK 


s ARITHMETIC OPERATIONS 


Adder/Subtracter Operations 





oO 











~-oO o}/-+00 
~e Re Ss OOCOO0O 
-oOo .- Oo 
—~ ww SP OF OO EC 





= oj7/ 0 - 0-0 








INPUT OUTPUT 


a Owe OO Oo 


-2-~ oof 001K MK K KM KK [K KK KK KM KK 
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Control Lines Description 
Cg C7 Ce Cy Ca Cz Co Cy Cy 


MICRO-NOP 
LOAD CM FROM RAM 


MOVE FROM CMS 
INTO CM 


LOAD BINTO CM 


MICRO-NOP 

LOAD NS FROM RAM 
MOVE FROM N INTO NS 
MOVE FROM S INTONS 
MOVE FROM EW INTO NS 
MOVE FROM C INTO NS 
LOAD OINTO NS 


MICRO-NOP 

LOAD EW FROM RAM 
MOVE FROME INTO EW 
MOVE FROM W INTO EW 
MOVE FROM NS INTO EW 
MOVE FROM C INTO EW 
LOAD @iINTO EW 


~-~- - oscsdoTx 


x 
x 
x 
x 
x 
x 
x 
x 
0 
0 
8) 
0 
1 

1 

1 


~o ofr2 0 01K K KX KK XK KX! x 


MICRO-NOP 

LOAD C FROM RAM 
MOVE FROM NS INTO C 
MOVE FROM EW INTO C 
LOAD C FROM CARRY 
LOAD C FROM BORROW 
LOAD @INTOC 

LOAD TINTOC 


~o-7-o=- 0-4 OK K KK MK KK EK MK KK MK MI OK 





READ FROM RAM 
LOAD RAM FROM CM 
LOAD RAM FROM C 
LOAD RAM FROM SUM 


xx xx |x KM KM KK KEK MK KK KK fo - OF OH OTK 
xx x xXx IK MK KKK MK MLK KM KKK KKK KK KK 
xx«x x x«xIKX MM KKM KK LK KK KK MR MLK KK KK KK 


x 
x 
x 
Xx 
x 
x 
X 
x 
Q 
1 

0 
4 

a] 
1 

0 
x 
x 
x 
x 
x 
x 
x 
x 
x 
x 
x 
x 


xx «KK LK KK KK KK OK 
xx x xX |K KKK KM KK 
x «x «KK TK KKK KK KM MI KK KK KK OK 
x x «x «KIK KKM KK KM KK KM MK OK OK 





# LOGIC OPERATIONS 


LOGICAL 
OPERATION DESCRIPTION CONDITIONS 


EW=0,C=1 
NS=0,C=1 
NS = 0, EW=1 






















CY = NSeEW 
CY=EWeC 
CY=NSeC 
BW = NSeEW 


cy = NS + EW 
BW =NS + EW 
BW=EW+C 
SM=NS@C 


SM = NS @ EW 
SM=EW@C 













XOR 


XNOR 
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« TABLE OF PIN NUMBERS VS. SIGNAL LABELS 
(CERAMIC PACKAGE) 





N.C. = No Connection 
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NCR45CG72 
# CERAMIC PIN GRID ARRAY PACKAGE 


BOTTOM VIEW 


1.060 + .010SQ 


pin Al indicator 


DOODOdD OD. 
LOOSE 
50000. 


SS TITS 






: 0.050 " 
+0.010 
| 0.080 + 0.008 
0.150 + 0.010 
SIDE VIEW 





NCR45CG72 | 


© TABLE OF PIN NUMBERS VS. SIGNAL LABELS | 
(PLASTIC CHIP CARRIER) 


ono nz Mm mH B&F WwW DY = 


= = = 
NR 6S 


13 





NC = No Connection 


10 








; NCR45CG72 






lee ® PLASTIC CHIP CARRIER PACKAGE 
1.190 
sa. 
a 1.153 
SQ. 
, 0.045 X 45° 
_ , 0.576 CHAMFER 
: 0.450 
aan 32 12 
0.010 X 45° SE SISSIES SSCS TRIRESIRISIsturers 4 
, CHAMFER 33 11 


o-_, 
os 


ai 


~- 
BSEGESERE Bias aiaitaetebsEaRatateseatstanee 


{ 
2 
Oo 
Tal 
a) 
on 
6) 
a 


74 
| 

| TOP VIEW PIN 1 

‘ INDICATOR 


0.045 X 45° 


: | a CHAM 


{et ai 


0.107 
0.150 
1120+] 
i) 
SIDE VIEW 
' a Ail dimensions are in inches 
v= 11 
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« TABLE OF SIGNAL NAMES VS. PIN NUMBERS 
(CERAMIC AND PLASTIC) 


SIGNAL PLA SIGNAL PLA SIGNAL 
NAMES PIN NAMES PIN NAMES PIN 
14 Nog K7 | 60 Cy J9 52 


INICIR, 


B3 
C4 
A4 
Ag 
AB 
B8 
-K1 
J4 
J5 
H6 
J? 
K8 
G8 


H9 
G9 
F9 





DEVICES TESTED WITH THIS 
OUTPUT LOAD CONFIGURATION 


DUT 
All outputs 
except GO 


Ry = 2.3K92 


DUT 
Global 


Output (GO) C, = 40pF 


Open drain output on GO allows up to “= 
4 devices to be connected together. 





NCR Microelectronics Division 2001 Danfietd Ce. Fort Collins, Colorado 80525 
Telex: 645-4505 NCRMICRO FTCN Phone: 303/226-9500 303/223-5100 
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GAPP”CORNER—TURN BUFFER 
(Word Serial/Bit Parallel to Word Paralle!/Bit Serial Shift Register) 


— | YR 


® APPLICATIONS 


® Two Dimensional Array Data Formatter 
® Conversion of data from word serial/bit paralle! format to word parallel/bit serial or vice versa 
® Buffer for input data to a GAPP ™ array 


\ 
i ed —— 


= GENERAL DESCRIPTION 


—_ Oe 


The NCR45CTG is a two-dimensional array of shift registers. It is a 6 x 12 arrangement of register groups with 5 registers 
per group. Data can be independently shifted in east-west and north-south directions through latches called EW and NS or 
stored in jocal registers called C and R. An additional register path in the N-S direction calied CM is unidirectional, shifting 
south to north only, The NS and EW paths can shift data bidirectionally. 


The NCR45CT6 is a shift register device that allows data to be input into a GAPP array in bit-serial format from data 
sources whose outputs are in word format (such as A/D converter). The 45CT6 devices can be configured to buffer a string 
of data words from an A/D converter, for example. Once a full line of data is stored it is then shifted into the GAPP array in 
bit serial format. This is achieved by shifting the LSB of each word from the 45CT6 line buffer into the GAPP array in 
parallel, by storing these bits in RAM within the GAPP array, and subsequently shifting increasingly significant bits from the 
45CT6 line buffer into the GAPP array. in real time video applications this shifting of data into the GAPP array can take 
place during the horizontal retrace interval (refer to GAPP Application Note #1 for a suggested system implementation). 


am >, ele 


es FEATURES 


CMOS register array with 360 registers 

6 x 12 array of single bit register groups, each containing 5 registers 

Single instruction muttiple data stream architecture — all register groups operate in parallel, 
Devices are cascadeable in two dimensions 


Register clear and set capability 
Compatible with NCR45CG72 Geometric Arithmetic Parallel Processor chip 


GAPP™ is a trademark of NCR Corporation 
Copyright © 1985 by NCR Corporation, Dayton, Ohio, USA. All rights reserved. Printed in USA. June 1985 
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NCR45CT6 
e ABSOLUTE MAXIMUM RATINGS 


Supply Voltage, Vpp..-. 2-2... 2. ee eee +7V 
Voltage on any pin with respect 

TO GhOUNG.. «2.5 8 da eeawa wena es —0.3 to Vop + 0.3V 
Storage temperature.............. —65°C to 150°C 
CAUTION 


Stresses above “absolute maximum ratings” may result 
in damage to the device. Functional operation of devices 
at the “absolute maximum ratings’ or above the recom- 
mended operation conditions stipulated elsewhere in this 
specification is not implied. 


1, CMOS Devices are damaged by high energy electrostatic discharge. Devices must be stored in conductive foam or with 
all pins shunted. Precautions should be taken to avoid application of voltages higher than the maximum rating. 


2. Remove power before insertion or removal of this device. 


e OPERATING CHARACTERISTICS 


[Supply Current WOeF loads) | top 
5 
0 
A 
ie) 


D 
L 
[| Output High Voltage (lon = 1mA) | Von 
| Temperature LT 
[Output Capacitance | 


Leakage Current on any 
input or 1/0 Pin 





NCR has a license from Martin Marietta Aerospece to menufecture and market GAPP corner turn devices only for commercial and indus- 
triat applications. GAPP corner turn devices may not be sold by NCR to the military market snd may not be incorporated into equipment 
for the military market without authorization from Martin Merietta Aeroaspece, Oriendo, Florida. “Military Merket" shall meen the market 
detined by procurements of product mede directly for the U.6. Department of Defense or any other U.S. Government agency or any foreign 
governments, for use in equipment intended for military eppllcstion end, technicelly cherecterized for such application by construction, 
extreme environment capability, electronic circult adaptations for specifically designed military equipment, end or being type designated by 
any legally euthorized government or joint government-industry body, which can confer such designations. 


a 
2 NCR reserves the right to make any changes or discontinue eltogether without notice with respect to any hardware or software product or the 


technical content herein. 
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NCR45CT6 
® TIMING DIAGRAM 


& 
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s 
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ty 


VID N= OO 
ioenoenie 


INPUTS 
A——_ 7 
g eg 
8 
me ee dt 
alle ee 


Ae |. 
& 


HM 


Peackcall 


a eee 
= 
N 


NOTE: 1,2,3 refer to the staging sequence of instruction, data in and data out 


s AC CHARACTERISTICS 











CserueTME ed 
oe 
<a 
a 
Temvoureut dt [720] 


NOTE: {1) d.c. by design; tested at 5 jssec. 
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s REGISTER GROUP AND DATA BUS IDENTIFICATION 


Wac 


Wao 


CMNogo 


»Noo 


~ 


CMSgp 
Sgo 


CMSa1 


TOP VIEW OF PACKAGE 
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2 2 
= 4 = & 
Oo 2 Oo z 
Er 
E15 
Eos GLOBAL 
CONNECTIONS 
TO EVERY 
E35, PROCESSOR 
ELEMENT 
IN THE ARRAY: 












Control Lines 
Co -Cc 





i. 
-_ 
wd 
= 
~ 
e 
~~ 
Gy 
ond 
m 
J 
am 


rs 


CMS g5 
Ses 


NOTE: This numbering scheme may be extended in systems which contain more than one GAPP corner turn device. 


PIN LABELS 
Woo — Wao WEST DATA BUS 
Eos — Eas EAST DATA BUS 
Noo — Nos NORTH DATA BUS 
Seo — Ses SOUTH DATA BUS 
CMSap — CMSgs INPUT BUS 
CMNog — CMNos OUTPUT BUS 
M5 Re CONINSTRUCTION BUS 





— GLOBAL DATA INPUT 
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=» BLOCK DIAGRAM OF CONNECTIONS BETWEEN 
FOUR REGISTER GROUPS 


ie re 
Bidirectional 
£ /\ \/ Noninverting Ls /\ VV 
t/O Buffer 


OF OE fs 
CMN 
NS‘ 
ae tet esel le oe 






CMS 


E 
> EW. 
Wo Fe a Es 2a 
isa one 
OE cms N/S : 
; : ' co 
| t { 
OE = Output Enable is an internal connection. 
East Outputs enabled whenever EW:=W 
West Outputs enabled whenever EW:=E 


North Outputs enabled whenever NS:=S 
South Outputs enabled whenever NS:=N 





NCR45CT6 
* SCHEMATIC DIAGRAM OF ONE REGISTER GROUP 






PLEXORS 


1 
—— at 





REGISTERS 


r 
}- 


TS a ae 


CM = Communications Register 
CMS = Communications South input 
MN = Communication North Output _ 
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® INSTRUCTION SET 
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NCR45CT6 


MICRO-NOP 
LOAD CM FROM R 


MOVE FROM CMS 
INTO CM 


LOAD BINTO CM 


MICRO-NOP 

LOAD NS FROM R 

MOVE FROM N INTO NS 
MOVE FROM S INTO NS 
MOVE FROM EW INTO NS 
MOVE FROM C INTO NS 
LOAD BINTONS 


MICRO-NOP 

LOAD EW FROM R 

MOVE FROM E INTO EW 
MOVE FROM W INTO EW 
MOVE FROM NS INTO EW 
MOVE FROM C INTO EW 
LOAD 6 INTO EW 


MICRO-NOP 

LOAD C FROM R 

MOVE FROM NS INTOC 
MOVE FROM EW INTOC 
LOAD BINTOC 

LOAD t INTOC 


MICRO-NOP 
LOAD R FROM CM 
LOAD R FROM C 
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« TABLE OF PIN NUMBERS VS. SIGNAL LABELS 
(CERAMIC PACKAGE) 


N.C. = No Connection Note: All Veg pins must be connected to ground. 


CMNo1 
CMNo2 

No3 

CMNo4 
Test Output 
C7 

Ess 

CMNoo 


Vss 
Vss 
Vss 








b i NCR45CT6 
=» CERAMIC PIN GRID ARRAY PACKAGE 


a BOTTOM VIEW 


y 


LINE 


a . as, <p as 





0.080 + 0.008 


+ 






0.150 + 0.010 


SIDE VIEW 
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= TABLE OF PIN NUMBERS VS. SIGNAL LABELS r 
(PLASTIC CHIP CARRIER) ot 


VOD aun} 


Test Output 
Cg 
Cs 


Sees ck 





NC = No Connection Note: All Vgs connections must be connected to ground 
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NCR45CT6 
® PLASTIC CHIP CARRIER PACKAGE 





1.190 
sa. 
1.153 
sa. 
0.045 X 45° 
0.576 CHAMFER 
0.010 X 45° : 
CHAMFER 33 ee | 
a 
u 
i 
5 60.576 
a 
B 
u 
| 
8 
Sj 4} ——_—__--_- 
D 84 
| 
a 
] 
a 
| 
0.072 LB 
8 
0.093 | = I 75 | 
0.050 
| TOP VIEW PIN 1 
{INDICATOR 
0.045 X 45° 
cameo | earn ] CHAM 





0.107 0-035 | 
0.150 
_—— er 


SIDE VIEW 
All dimensions are in inches 
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= TABLE OF SIGNAL NAMES VS. PIN NUMBERS 
(CERAMIC AND PLASTIC) 





Note: All Vgg pins must be connected to ground 


DEVICES TESTED WITH THIS 
OUTPUT LOAD CONFIGURATION 


DUT 
All outputs 


L 





NCR Microelectronics Division 2001 Danfield Ct. Fort Collins, Colorado 80525 
Telex: 045-4505 NCRMICRO FTCN Phone: 303/226-9500 303/223-5100 
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GAPP™ PC DEVELOPMENT SYSTEM 


GENERAL DESCRIPTION 


The GAPP PC Development System is composed of two parts. The first is a hardware board which is compatible with the 
IBM-PC 1/O bus and contains a 12 by 12 array of processor elements implemented with two GAPP devices. The second part 
is a software package which allows the user to program the GAPP array in a high-level language and interactively debug a 


program. 


HARDWARE FEATURES 


12 by 12 array of GAPP Processor Elements (PE). 

12 byte reformatting/corner-turn array for data input/output. 

Interface to [BM standard bus with TTL circuitry. 

Three 8-bit registers for GAPP control and address interface. 

Two I/O ports for data down-load/up-load to or from GAPP array. 

Register and port addresses are switch selected. 

PE array clock is software controlled through separate |/O port. 

Printed circuit board plugs into bus connector of IBM compatible personal computers. 


Cylindrical wrap: all East and West I/O lines are horizontally connected at the left and right edges of the array; all 
North and South I/O lines are vertically connected at the top and bottom edges of the array. 


SOFTWARE PACKAGE FEATURES 


Menu driven with screen oriented displays 
GAL™ (GAPP Algorithm Language) compiler. 
Simple text editor for program corrections. 


Debug routines allow user to: 
single step through GAPP instructions, 
execute an entire block of GAL program statements, 
execute entire program, 
stop at any time for program corrections/re-compilation. 


GAPP PE editor aliows user to: 
up-load/change/down-4oad contents of each PE RAM, 
up-oad/change/down-load contents of PE registers, 
store or load any of the above data to/from a data file. 
data files can be edited with the text editor. 


Runs under the VENIX/86™ (NCR45GDS1-VX) operating system on any NCR Model 4 (with hard disk) or IBM PC-xT™ 
compatible. Also available in an MS-DOS™ version (NCR45GDS1-MS) for BM compatible personal computers. 


NCR reserves the right to meke any changes or discontinue altogether without notice with respect to eny herdwere or software product or the 
technical content hersin. 


Copyright © 1885 by NCR Corporation, Dayton, Ohio, USA 
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FEATURES OF GAL 


The GAPP Algorithm Language is a subset of the C programming language with several features added to tailor the language 
to the GAPP. Features of the C programming language which have been implemented are: 


All arithmetic, logical, and assignment operators. 

int variables, can contain the values from —32768 to 32767. 

Variables defined inside of a block (within { }‘s } are automatic (storage space can be re-used outside of the block). 
The #, if... else, for, and while program statements are implemented. 


Support for subroutines and int functions is provided. Arguments to the subroutine are the values of the variables which 
are used in the subroutine call. The values of these variables are not changed by the subroutine. 


Additiona! features have been added which are unique to GAL: 


A new type of variable is used to refer to GAPP RAM addresses. An image variable is used to refer to a set of adjacent 
RAM locations starting at address X, with n number of bits, Image variables are declared by one of the following program 
statements: 

image SCRATCH :3:7; 

image SCRATCH : 5; 
The first form defines an image named “SCRATCH” which starts at RAM address 3 and ends at RAM address 7. The sec- 
ond form also defines an image name “SCRATCH,” but only specifies the number of bits (5 ). The starting address of 
the image is left up to the GAL compiler. 
image names can be used to specify the address portion of s GAPP instruction. The programmer must specify the name 


of the image and an arithmetic expression which gives the offset within the image. The compiler adds the starting address 
of the image to the arithmetic expression to determine the GAPP RAM address. An example is 


SCRATCH :i+3 


Either the image name, or the arithmetic expression may be omitted, but not both. If the expression is omitted, the com- 
piler uses O for the offset; if the image name is omitted, the compiler uses the expression for the address. 


The function size{ } is built into the compiler; the function accepts the name of an image as an argument and returns the 
number of bits in the image. 


A legal GAL program statement is a GAPP instruction made up of GAPP RAM address and a tist of GAPP assembler 
mnemonics. Exampies of GAPP instructions are: 


X:iew: "cram: "cy; 

ram (X:4}: 2c; 

ew :=ram(:2); 
in addition to using int variables as arguments to subroutine calls, the names of images may also be used. Both the start- 
ing address and the size of the image are put on the argument stack for the subroutine to use. 


The status of the Global Output pin can be used as a criteria for conditional execution of GAL program statements. This 
is accomplished by the program statements: 


if(goset) 

if(gocir) 

if(goset). . . else 

if(gocir). . . else 

while(goset) 

while(gocir) 

for {.. .; goset;. . .) 

for (. . .:gocir: ..)} 
The term goclr is non-zero (true) if the Globe) Output is low {one or more NS register contains a 1). The term goset is 
non-zero (true) if the Global Output is high (all NS registers contain 0). 


™GAPP and GAL are trademarks of NCR Corporation. 
™ VENIX/86 is a trademark of VenturCom, Inc. 
™ MS-DOS is a trademark of Microsoft Corp. 
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=» USING THE GAPP PC DEVELOPMENT SYSTEM 


a... Taam 


a) Displaying the main menu .. . 







Copyright 1985 by NCR Corp. Dayton, Ohio, USA. All rights reserved. 


NCR GAPP PC Development System version 1.0 
GAL release 2.0. Hardware module version 1.1. 

Compile GAL prograsp 

Debug progran 

Gxecute progran 

GAPP BAM editor 

Edit program {vi) 

Ipitielize (clear) GAPP array 


Quit 


b) Reading from GAPP RAM... 
Copyright 1985 by NCR Corp. Dayton, Ohio, USA. All rights reserved. 


NCR GAPP PC Development System version 1.0 
GAL release 2.0. Hardware module version 1.1. 


4 


6 


a + 
eeoovoe0ccecoes 


@eoecoceceorcec 

eoooccvdocecosso 
qgooeocececos 
ecoeooecoeocceseo 


GAPP RAM addresses 16 to 23 unsigned decival dieplay 
Bownloed [EG] Change $tore te file Display mode 


c}) Debugging a GAPP program ... 
Copyright 1985 by NCR Corp. Dayton, Obio, USA. All rights reserved. 


NCR GAPP PC Development System version 1.0 
GAL release 2.0, Haraware module version 1.1. 


Tee :F en 


1 © rt cy 
@%) carr instr Bieck (FF) Pregran (Bf wan ocit (%) seit (vid GX) abort 


$6 | 


ee ee ee 
ie 


- 


d) Flagging a program error... 


“edge, gal" 
"edge. gal” 


e) Entering editor to correct program... 
@eain{) 


{ 


, 4 ¢ Ba 
= em, 
ew :: rem: 
Tenp :- en 


Z:i 0 ewis:t rep se 


for (i = 0; a2 << B; i 


line 19; syntax error. 
line 23: symtax error. 


Bit RETURN to continue 





ot) { 
/* Get bit from operand ¥ 8/ 
/@ Get bit from operand ¥ ¢/ 


© ost ey; ¢@ add bite and store in result 


1= © ers 3 


oe) { 


Z:i + 1 ew: rem c :2 ey; 


2:4 vem :t 


} 


© e:= 3; 


“edge.gel” 24 lines, 378 characters 
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GAPP™ SIMULATOR/ASSEMBLER 


s GENERAL DESCRIPTION 


The GAPP Simulator/Assembler package is composed of two utilities which operate under the UNIX ™ (NCR45GS2-UX) 
or VAX/VMS'™ (NCR45GS2-VM}) operating systems. The first utility is the assembler, GAPASM, which translates GAPP 
instruction mnemonics and address specifications into binary object code suitable for down-loading into a control state. 
The simulator, GAPSIM, is an interactive package which enables the user to “execute’’ GAPP programs, and view the con- 
tents of GAPP RAM and the state of the Processor Element registers in order to verify and/or debug the program. 


™GAPP is a trademark of NCR Corporation 
™UNIX is a trademark of AT&T Bell Laboratories 
™VAX and VMS are trademarks of Digital Equipment Corp. 


# ASSEMBLER 


The assembler is invoked with a command of the form: 
gapasm [ -o outfiie | [ filename } 

where “‘outfile’’ is the output filename and “filename’’ is the input filename. The square brackets ( [ ]’s ) indicate that both 
the input and output filenames are optional, If used, the input filename must end with “.asm’’, and if the input filename is 
not given, the standard input is assumed (terminal console, or whatever was specified using the “<"' re-direction feature of 
UNIX). !# the standard input is the console, then the user may type GAPP assembler instructions and the assembler wil! 
interactively return the GAPP object code values. If the output filename is not given, the assembler output goes to a file of 
the same name as the input filename, but ending with “.gap’’. The “’-o” flag can be used to direct the output to any filename, 
which does not have to end with “.gap”. 


The assembler assumes an input format for GAPP assembly instructions: 

© GAPP RAM address must start in column 0, and be given in hexadecimal notation (allowed values are 0 to 7F hex). 
@ Spaces or tabs are used to separate the RAM address from the GAPP instruction field. 

e A GAPP instruction field is composed of up to five micro-instruction fields, each separated by spaces or tabs. 


© A GAPP micro-instruction field is composed of a destination mnemonic and a source mnemonic, separated by ‘=’ or 
:=' Valid destination and source mnemonics are: ew, ns, cm, c, and ram. Additional valid source mnemonics are: n, 
$s, &, W, Cy, carry, bw, borrow, sm, and plus. cy and carry have the same meaning; the same is true for bw and borrow, 
and sm and plus. Not ali combinations of source and destination mnemonics are legal GAPP instructions; see the GAPP 
(NCR45CG72)} data sheet for details. 


® Comments start with a semicolon {;) and continue to the end of the line. 


ver 


The assembler checks for the following errors: 

® Invalid destination or source mnemonics. 

e Invalid combination of source and destination mnemonics (illegal GAPP instruction). 
® Conflicting micro-instruction mnemonics (e.g. nNs*ew ns=ram). 

@ Invalid GAPP RAM addresses. 


Any errors cause an error message, but processing of the input file continues, GAPASM exits with a return code of 0 if no 
errors occurred, otherwise it exits with a return code of 1. Any error messages go to the standard error output. 


NCR reserves the right to make any changes or discontinue altogether without notice with respect to any hardware or software product or the 
technical content herein. 
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No filename of image to load - Occurs during a “store’’ command. No filename was given with the “store’’ command. 


Invalid stack frame to load - Occurs during a “‘load’’ command, The RAM frame was not given with the “load’’ command, 
Or it was not between 0 and 15. 


File write error - Occurs during a “store’’ command. The command could not be completed because of a system I/O error. 
The system error message is printed. See the appropriate operating system reference manual for explanation of the 
system error message. 


File read error - Occurs during a “load” or “do” command. The command could not be completed because of a system |/O 
error. The system error message is printed. See the appropriate operating system reference manual for explanation of 
the system error message. 


No filename to perform - Occurs during a “‘do’’ command. No filename was given with the “‘do”’ command, 


Invalid index into stack - Occurs with the “rgrid’’ command. The RAM frame number was not given with the “rgrid’’ com- 
mand or was not between 0 and 15. 


myshel: fork failed - An error occured while trying to execute a UNIX command entered using the “ | ‘ feature. See the 
“fork{2)” entry in the UNIX Reference Manual for explanation, 


Other error messages not listed above are the result of UNIX or VAX/VMS system errors, and an appropriate system error 
message will be printed. The operating system reference manuals should be consulted, 
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NCR45GDS1 
Main Menu INSERT 


C-Compite GAL™ Program 
D - Debug/Execute GAL Program 
P - Edit Program 

t- Initialize (Clear) GAPP Array 
S - System Configuration 


X - Temporary Exit to Execute 
System Command 


Q - Exit GAPSYS ™ 


















Editor 
(Program File} 








Debug Sub-menu Configuration Menu 





E - Specify Editor Pathname 


t - Specify Maximum Number of 
Instruction Cycles (Designed 
Timeout} for GAL Program 


A - Add Fite to Subroutine Library 
List 


G - Execute Single GAPP Instruction 

B - Execute Program to Breakpoint 

F - Execute to End of Program 

P - Edit Program “pe Editor 
(Program F ite} 


D - Delete File from Subroutine 
Library List 


Q - Return to Main Menu 


U - Upload Data from GAPP RAM 

R - Upioad Data from GAPP Register 
tL - Load Data from File 

D - Edit Data File “p" Editor 


Q - Return to Main Menu (Data Fite} 





“uu” iol Diet ae " 
; R 


Register Seiect 


C - Display C Register 

M - Disptay CM Register 
N - Display NS Register 
E - Display EW Register 
A - Disptay All Registers 





hed Ohad “MA” “wh” od he “A” 
’ . ’ 


Display Data Sub-menu 


G - Execute Single GAPP Instruction 

B - Execute Program to Breakpoint 

F - Execute to End of Program 

P - Edit Program “pr 


Editor 


H,J,K.L - Cursor Movement within (Program File} 


Data Display 





D - Download Data to GAPP Array 


S - Store Data to File (Specify 
Filename) 


C - Change Number Base (Hex or 
Decimal} “Q" 


Q- Return to Debug Menu 





GAPSYS Interactive Menu Structure 
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E/W WRAP AROUND 
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CONTROL REGISTER 






N/S WRAP AROUND 
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GAPP™ PROCESSOR 


ARRAY 
(2 GAPP CHIPS) 
READ DATA 
CORNER TURN LINE BUFFER 





ADDRESS REGISTER 






IBM PC I/O DATA BUS 


(2 NCR45CT6 CHIPS) 





WRITE DATA REGISTER 


BLOCK DIAGRAM. GAPP PC DEVELOPMENT SYSTEM 
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Main Menu INSERT 


C-Compite GAL™ Program 

D - Debug/Execute GAL Program 
P - Edit Program 

I - Initialize (Clear) GAPP Array 
S - System Configuration 


X - Temporary Exit to Execute 
System Command 


QO - Exit GAPSYS ™ 














Editor 


Debug Sub-menu (Program File) 









E - Specify Editor Pathname 


| - Specify Maximum Number of 
Instruction Cycies (Designed 
Timeout} for GAL Program 

A> Add Fite to Subroutine Library 

List 

D - Delete File from Subroutine 
Library List 


Q - Return to Main Menu 


- Execute Single GAPP Instruction 
- Execute Program to Breakpoint 






- Execute to End of Program 
- Edit Program “pr Editor 
{Program F ite} 






- Upload Data from GAPP RAM 
a - Upload Data from GAPP Register 

- Load Data from File 

- Edit Data File “Dp” Editor 





(Data File} 


- Return to Main Menu 
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C - Display C Register 

M - Display CM Register 
N - Display NS Register 
E - Display EW Register 
A.- Display Ail Registers 
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Display Data Sub-menu 


G - Execute Single GAPP Instruction 
B - Execute Program to Breakpoint 
i F - Execute to End of Program 
P - Edit Program sapere 
: P Editor 


H,J,K,L - Cursor Movement within {Program File) 


Data Display 
= PD - Download Data to GAPP Array 





S - Store Data to File (Specify 
Filename) 


C - Change Number Base (Hex or 
ae Decimat} “Q" 


Q - Return to Debug Menu 





GAPSYS Interactive Menu Structure 
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INSERT 


s USING THE GAPP SIMULATOR/ASSEMBLER 


% gapsim 


Copyright 1985 NCR Corp. Dayton, Ohio, USA. 
rights reserved 


GAPP Simulator version 1.0 System :UNIX 
Array size: 8 by 8 


. load demo.pic 0 
. rgrid 0 


COOK VUcton 
oocorocos 
OWE NON- 





To enter the GAPP Simulator/Assembler, the user types 
the command ‘‘gapsim’’. Upon entering the Simulator a 
message is typed to inform the user of the current array 
size {in this case the default size 8 by 8 is returned). A 
period ‘’.” is the command tine prompt within the Simu- 
lator. In the above figure the user loads data from a file 
named ‘“‘demo.pic’” and stores it in GAPP RAM frame 0 
(RAM focations 0-7}. The user then displays the contents 
of RAM frame 0 using the “rgrid’’ command. 


pot 
Mor wo w Une 
at Go ht te Oo Uo se 


OMmmanaewoe 
OCOMmacmmaooc 
OCOeemowoce 
OOoocoocoo 
Ooooeocoos 


0 
8 
0 
0 
8 
0 
8 
0 





Executing a program “‘thresh.do” on input data stored in 
GAPP RAM frame 5. When the program has completed 
execution, the user inspects the contents of data registers 
within each processor element in the GAPP array using 
the “‘pgrid’’ command. Here we see that certain proces- 
sor elements have their ‘‘C’’ registers set to ‘’1’° while 
other processor elements’ ‘’C’’ registers contain ‘‘0’’. 
All other registers (“CM, “NS”, and “’EW’’) contain logic 
value O in all processor elements. 
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c:=0 

ew: = Tram 

ew: =w ns: = ram 
ram: = sm c: = cy 
ew: = ram 

ew: = w ons: = ram 
ram: = sm c: = cy 
ew: = ram 
ew: = w ns: = ram 
ram: =sm ¢:= cy 
@w: = ram 

ew: =w ns: = ram 
ram: = sm c:=cy 
ew: = ram 

Q@w: = w ns: = fam 
ram: =sm c:=cy 
ram: =c c:=0 
@w:=ram 

ew: =e ns:=ram 
ram: = sm ¢:=cy 
ew: = ram 

ew: =e ns: = ram 
ram: =sm c:=cy 
ew: = ram 


aint 


—_ 


0 
0 
0 
8 
1 
1 
9 
2 
2 
a 
3 
3 
b 
4 
4 
Cc 
d 
8 
8 
0 
9 
9 
1 
a 





An example of an assembly program ("horiz_edge.do”) 
run on the GAPP simulator-written entirely in GAPP 
mnemonics. 
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. do horiz_edge 
. grid 3 
fd 


7 
fd 


. load demo. pic 0 
. Tgrid 0 


fe 


2 
0 
4 
28 
11 
10 
2 
1 


oooocroow 
ON 19 Oh? 





Displaying the results of ‘“horiz_edge.do’’. The output oe 


data is stored in GAPP RAM frame 3 (RAM locations Executing “horiz_edge.do” using the ‘do’ command. 
24-31). Input data from file ‘‘demo.pic”’ is stored in GAPP RAM 
frame 0. 


coocoocoeoco 
oooocooo 
ooocooooe°o 


. xsize 10 
. ysize 10 ares 
. xsize 

Array size: 10 by 10. 
. rgrid 6 

0 

ff 

0 


0 
ff 
0 
ff 
0 
0 
0 


TOOSDOCODOCCO 
SPEOCSDSSDO0S 
DoomRoRooo 
2eonccocecoc0e 
oco0c000000 
SCODGCCOD000 
SCOMSSDGCODOCOSO 


“< 
a) 





Changing the GAPP processor array size from 8 by 8 to _ 
10 by 10 using the ‘’xsize’’ and “ysize’’ commands. The 

user exits the GAPP Simulator/Assembler using the 

“bye’’ command. 


Ci-43 NCR Microelectronics Division 2001 Danfield Ct. Fort Collins, Cotorado 80525-2998 
Telex: 045-4505 NCRMICRO FTCN Phone: 303/226-9500 303/223-5100 





CAPPIX Il 


COMPUPIX ARITHMETIC PARALLEL PROCESSOR 











a The CompuPix CAPPIX fl is an IBM-PC BUS Compatible Processor Board with up to 288 CPU's, 
36864 x 1 BIT RAM. 1152 Single Bit Latches, and 8K Bytes of Data RAM. 


FEATURES: ARCHITECTURE: 
— Sings Princes Circuit Boara The CAPPIX Il is memory mapped. The 
7 — 144 (expandable to 288) Processors Data RAM is mapped in 4K Byte seg- 
ments and can be located to meet the 
user requirements. The CAPPIX proces- 
sor 1C’s each house a CMOS array of 72 
— Interfaces with IBM-PC and IBM-PC 
Compatinie Systems processors. Each processor is composed 
7 of abitserial ALU, 128x 1 Bit RAM, and 4 
— Operates on UNIX* Compatible and 
MS-DOS? Operating Systems single Bit latches. Three latches hold 
inputs to the ALU and the fourth latch 
allows !/O tnrough the cell without inter- 
rupting the ALU. I/O operations are over- 
— 9C Day Warranty lapped with computation. 


— 4! (expandat!< to BK) Bytes of 
Daia RAN’ 


— 36£64 x1 BIT RAR 


— Data RAM Mapped into main memory 


UNIX® is a trade and service mark of Be!! Laboratories 
MS-DOS 's a registered trade and service mark of Microsoft Corporation 
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= Telephone: Maidenhead (0628) 75851 Telex: 847898 MANSKY Facsimile: (0628) 782812 
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GENERAL DESCRIPTION 





THE COMPUPIX CAPPIX Il PARALLEL PROCESSOR IS A PROGRAMMABLE GENERAL 
PROCESSING BOARD, CONTAINING ITS OWN ON-BOARD CONTROLLER AND UP TO 
8K BYTES OF SINGLE PORT SRAM MEMORY. BY USING SINGLE PORT RAM's, DATA 
CAN BE: 1) WRITTEN INTO BY THE HOST COMPUTER, 2) READ INTO THE HOST 
COMPUTER, 3) WRITTEN INTO BY THE CAPPIX PROCESSOR, OR 4) READ INTO THE 
CAPPIX PROCESSOR. 


THE CAPPIX il BOARD IS EASILY PROGRAMMED, AND IS SUITABLE FOR PATTERN 
RECOGNITION, PARALLEL PROCESSING, AND IMAGE PROCESSING. 


SPECIFICATIONS 
DIMENSIONS: 13.2 inch (33.5 cm) x 4.2 inch (10.7 cm) printed circuit board 
DATA RAM: 4K (Expandable to 8K) Words 
CYCLE TIME 45 Nanoseconds 
SHIPPING WEIGHT: = 2.5 Ibs (1.14 kg) including board and documentation 
POWER: Power supply 5 VDOC + 5% 

Current 2.5 amp typical, 3.5 amp maximum 
TEMPERATURE: Operating: 0 degrees C to 55 degrees C 


Shipping: -55 degrees C to +55 degrees C 


INSTALLATION 


EACH CAPPIX Ii BOARD |S SHIPPED WITH A DETAILED INSTALLATION AND 
INSTRUCTION MANUAL. INSTALLATION NORMALLY REQUIRES LESS THAN 10 
MINUTES. 





WARRANTY 


ALL CAPPIX li BOARDS ARE WARRANTED AGAINST DEFECTS IN MATERIALS OR 
WORKMANSHIP FOR 90 DAYS AFTER SHIPMENT DATE. DEFECTIVE BOARDS 
COVERED BY THIS WARRANTY SHALL BE RETURNED TO COMPUPIX PREPAID AFTER 
CONTACTING COMPUPIX FOR A RETURN MATERIALS AUTHORIZATION NUMBER. 
COMPUPIX PROVIDES 48 HOURS TURN AROUND OF REPLACED OR REPAIRED 
BOARDS TO THE PURCHASER. 
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NCR45CT6 
GAPP” APPLICATION NOTE NO. 1 


INPUT/OUTPUT OF REAL TIME VIDEO DATA 
USING THE NCR45CT6 


2 INTRODUCTION 


This application note describes how the NCR45CT6 is 
used to reformat real time video data for an array of 
NCR45CG72 GAPP “(Geometric Arithmetic Parallel Pro- 
cessor) devices. The NCR45CT6 devices are cascaded to 
create the “Corner Turn Line Buffer’. 


The Corner Turn Line Buffer performs two functions: 
loading data into the GAPP array, and unloading results 
from the GAPP array. 


Loading Data into the GAPP Array: 


Video data is received from an anatog-to digital con- 
verter one pixel at a time; usually, each pixel is re- 
presented by a group of bits (for illustrative pur- 
poses, these “groups” of bits will be referred to as 
“columns” of bits}. Once an entire line of pixels ts 
received, the Corner Turn Line Buffer holds a col- 
umn of bits for every pixel in the line. Now, the col- 
umns are shifted into the GAPP array one row ata 
time. For example, the most-significant bit of each 
column is shifted, followed by the second-most-sig- 
nificant bit, etc., until all of the bits have been shift- 
ed into the GAPP array. 


Unloading Data from the GAPP Array: 


Untoading data from the GAPP array is simply a 
matter of reconstructing the ‘‘cotumns” of bits for 


each pixel in a line. The resultant rows for each video 
line are shifted into the Corner Turn Line Buffer 
where each bit is placed in the appropriate pixel col- 
umn. Finally, the columns are shifted out of the line 
buffer one column (or pixels) at a time into a digital- 
to-analog converter, where a signal is produced which 
may be sent directly to a video monitor. 


» IMPLEMENTATION 


The following example is provided to illustrate an actual 
implementation of the Corner Turn Line Buffer. 


For this example, a window, or partial frame, of video 
data is defined as 48 tines; each tine contains 48 pixeis 
(i.e. a window is an array of 48-by-48 pixels). Each pixel 
is represented by six bits of data. 


Since each NCR45CT6 device contains an array of six- 
by-twelve processing elements, only one row of NCR- 
45CT6 devices is required to hold a line of video data 
(see Figure 1}. Two rows of NCR45CT6 devices may be 
used to hold a line of twelve bit per pixel data, or a line 
of eight bit per pixel data with an additional four bit 
planes available for graphic or text overlay. Four rows of 
NCR45CT6 devices may be used as a twenty-four bit per 
pixel line buffer (3 colors, 8 bits per color}. 





VIDEO 
iN 
| 
W W oan W 
= 6-BIT 
NCR4SCT6 Nj-jS NCR4SCT6 Ni7—jS NCR45CT6 NIZ|S NCR45CT6 N A-TO-D 
E € — E CONVERTER 


CORNER TURN LINE BUFFER 


Figure 1. Four NCR45CT6 devices connected as a 48 pixel long corner turn line buffer; each pixel is represented by 6 bits. 
The pixel data is shifted through the NS registers from the north to the south. 


Table |. Instruction Sequence Used By the Line Buffer to Load Data From A-to-D and Unioad to D-to-A 






Line Buffer 
instruction Queue 





ie) 


wo 


ame 
— 
Cas 


Description of Actions 


| oONS:=N | Shift in 1st pixel of line from analog-to-digital converter. 
Shift in 2nd pixel of line from analog-to-digital converter. 
Shift in 3rd pixel of line from analog-to-digital converter. 











hift in 47th pixel of line from analog-to digital converter. 


hift in last pixel of tine from analog-to-digital converter. 
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The flow of the data from the analog Video jn signal to Since each NCR45CT6 device contains a six-by- 
the analog Video Out signal is described below: twelve array of processors (six rows and twelve col- 
umns), the number of devices required to implement 
the six bit deep line buffer is one-twelfth the num- 
ber of pixeis per tine. 


1. The analog Video In signal is digitized to create a six- 
bit digital representation for each pixel. 


2. The video data is shifted into the NCR45CT6 devices 
by executing the sequence of instructions in Table 1. 
Each NS: = N instruction shifts in one pixel from 
the analog to digital converter. 


Figure 2 illustrates the two dimensional 48-by-48 
GAPP array with one processor elernent {PE} per 
pixel. The six bits per pixel are stored in six RAM 
locations within each PE. 
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Figure 2. Corner Turn Line Buffer interface to a 48X48 array of PE’s. {32 GAPP ICs, each containing a 6X12 array of PE‘s). 
image frames output from the CMN bus of the GAPP array are connected to the east inputs of the NCR45CT6 
devices in the video line buffer for conversion to a video line output of 6 bits/pixel. 


*R: 10KQ pull-up resistor on each CMS input prevents the inputs from floating when the west outputs of the corner 
turn line buffer devices are at Hi-Z. 


NCR reserves the right to make any changes or discontinue altogether without notice with respect to any hardware or software product or the 
technical content herein. 
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During the horizontal retrace interval, the video line 
is shifted into the processing array by executing the 
sequence of instructions listed in Table 2. The first 
instruction does a “corner turning’ operation by 
shifting the data from the NS register into the EW 
register of the NCR45CT6 devices. Simultaneously, 
the first bit of each pixel! in the GAPP array is fetch- 
ed from RAM with the CM: = RAM(Q) instruction. 
The next three instructions are repeated six times: 
once for each bit in the pixel, As each of the six bits 
per pixel are clocked out of the video fine buffer 
{using an EW: = € instruction}, they are read into 
the CM register of the bottom row of PEs of the 
GAPP processor array (using a CM: = CMS instruc- 
tion). This CM: = CMS instruction also causes the 
corresponding bit of each row in the processing 
array to be shifted up one row. Next, the data is 


St 
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saved in the appropriate RAM location while the 
corner turn buffer performs a NOP instruction. 
Finally, the next bit for each row in the GAPP 
array is fetched from RAM to prepare for the next 
shift, while the corner turn buffer executes another 
NOP instruction. 


After the entire video window has been joaded into 
the processor array, computations may begin. For 
some ajgorithms it is necessary to load two or more 
sequential frames into the GAPP array (each frame 
is stored in different RAM locations). 


The result of a computation consists of a video win- 
dow stored in RAM. The result is output to the tine 
buffer using the sequence of instructions listed in 
Table 2: If the computed result consists of 6 bits per 
pixel and resides in RAM locations 0 through 5, it 


Table 2. instruction Sequence Used to Load (Unload) Data To (From) GAPP Array. 


E 

EW: 
EW: 
EW: 


M (5) 








Description of Actions 
CM: = RAM (0) [Move new-line data into EW of Line Buffer and toad CM of array 
with LSB of previous lines. 
CM: = CMS Shift LSB of new line into south end of array while shifting previous 
line LSB’s up 1 row 
M: = 


RAM {0):=CM {Store LSB into RAM location 0 


CM: = RAM (1) |Load 2nd LSB of previous tines into CM of array to prepare for next 
shift. 
EW: =E CM: = CMS Shift 2nd LSB of new line into south end of array while shifting 2nd 


RAM (1}: = CM [Store 2nd LSB into RAM location 1. 


CM: = RAM (2) |Load 3rd LSB of previous lines into CM of array to prepare for next 
shift 
W:=E CM: = CMS Shift 3rd LSB of new line into south end of array while shifting 3rd 
LSB’s of previous lines up 1 row 
M 


LSB’s of previous lines up 1 row, 


RAM (2}:= CM [Store 3rd LSB into RAM location 1. 
CM: = RAM (3) |Load 3rd MSB of previous lines into CM of array to prepare for next 
shift. 
=E CM: = CMS Shift 3rd MSB of new line into south end of array while shifting 3rd 
MSB’s of previous fines up 1 row. 
zz RAM (3): =CM [Store 3rd MSB into RAM location 1. 
ee CM: = RAM (4) |Load 2nd MSB of previous lines into CM of array to prepare for next 
shift. 
CM: = CMS Shift 2nd MSB of new line into south end of array while shifting 2nd 

MSB’'s of previous lines up t row. 

= CM 


| NOP RAM(4}: = Store 2nd MSB into RAM location 1. 


Ne = | CM: = RAM (5) |Load MSB or previous lines into CM of array to prepare for next 
shift. 
CM: = CMS Shift MSB of new line into south end of array while shifting MSB’s 
of previous lines up 1 row, 
N RA CM 


Store MSB into RAM location 1 and place the output data which has 
been shifted into the Line Buffer’s EW from the top of the proces- 
sing array into the NS. 
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may be output simultaneously with the loading of a 
new video window. With the connections as shown 
in Figure 2, the output on the CMN bus from the 
north edge of the GAPP array is wrapped around 
and shifted into the east input of the line buffer. 


6. The output of each video tine into the line buffer 
using the instruction sequence in Table 2 spreads 
the 6 bits per pixel into 6 adjacent EW registers in 
the line buffer. The fast instruction in Table 2 per- 
forms the corner turning operation by placing the 
contents of the EW registers into the NS registers. 
Then during the input of the next video line from 
the analog-to-digital converter, the resuitant video 
line (now in the NS registers) is shifted out to the 
south (into the digital-to-analog converter) by ex- 
ecuting the sequence of instructions listed in Table 
1. The 6 bits per pixel are output directly to a dig- 
ital-to-anaolog converter which may provide a video 
signal directly to a display monitor, 


=» SAMPLE APPLICATION 


The following example is provided to demonstrate how 
this corner turning scheme is implemented for a specific 
application. 


For this example, Frame A is originally loaded into 


RAM locations 00 through 05 of the GAPP array. Dur- 
ing the vertical retrace period Frame A is processed and 
the result {Frame A‘) replaces it in RAM locations 00 
through 05. Then while Frame B is being loaded into the 
south edge of the GAPP array from the Corner Turn 
Line Buffer, Frame A’ is being unloaded from the north 
edge of the array into the Corner Turn Line Buffer. 


In Figure 3, line 2 of Frame 8 has just been loaded into 
the bottom row of PEs in the GAPP array and line 2 of 
Frame A‘ has just been output from the top row of the 
PEs in the GAPP array into the EW registers of the line 
buffer. Next, line 2 of Frame A’ is transferred to the NS 
registers of the line buffer with the operation NS: = EW. 
Now as line 3 of Frame B is shifted into the line buffer, 
the line 2 of Frame A’ is simultaneously shifted out 
from the line buffer into the digital-to-analog converter. 


For many applications requiring multiple frames, a more 
sophisticated scheme is used. Pipelining of frames may 
be required to obtain desired throughput. Another 
scheme might utilize special features of a hardware 
controller or the GAPP Language compiler to allow 
execution of microcode itn the GAPP array while data Is 
being loaded into the array {i.e, during a portion of the 
horizontal retrace period as well as the vertical retrace 
period), 


Outout Line 3 of Frame A’ 






CORNER TURN 
LINE BUFFER 


Output 
Line 2 of 
Frame A’ 
{to D/A converter) 


Figure 3. Illustration of one type of data flow in the GAPP system. Data is input to the NS registers from the right edge of 
the corner turn line buffer, transferred to the EW registers, shifted into the GAPP processor array on the CM bus, 
and downloaded into RAM. Result data from RAM is uploaded into the CM register, shifted out of the array and 
into the EW registers of the corner turn line buffer; finally, it is transferred to the NS registers and shifted out the 


left end of the corner turn line buffer. 








EW REGISTER PLANE 


NS REGISTER PLANE 


pen RAM O} Frame A’ 
M02 
Frame B 


PROCESSOR 
ARRAY 










yin 


| EW-=NS 









Input Line 3 
of Frame B 
(from A/D converter) 
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GAPP APPLICATION NOTE NO. 2 


GAPP MEMORY EXPANSION 


Although the Geometric Arithmetic Parallel Processor, 
or GAPP chip contains 128 bits of RAM per processing 
element it is occasionally desirable to expand the mem- 
ory using external static RAM to provide capability of 
storing additiona! image data or to perform numerical 
computations that require more memory space. 


Figure A shows one possible memory expansion tech- 
nique which utilizes six n X 1 bit RAM chips per GAPP 
device in the system. Figure A shows one RAM chip 
connected to each of the six CMN lines on a GAPP de- 
vice, This configuration is preferred when additional 
memory requirements are on the order of 2K or more 
bits per processor element. For example, using six 64K 
by 1 RAM chips would provide 5461 bits of RAM per 
processing element {64K divided by 12 is 5461 leaving a 
total of 4 bits of RAM unused). 


Figure B depicts a memory expansion technique utilizing 
asingle 8 bit wide RAM. This configuration is used when 
less than 2K of RAM per processing element is required 
because it minimizes the amount of hardware required. 
{i.e. it only requires one RAM chip). For example, 682 
bits of RAM per processor can be provided by a single 
8K X 8 RAM chip. 


In both configurations the GAPP uses the CM bus so 
that data transfer to RAM has minimum impact on atgo- 
rithm execution time. It takes 12 clock cycles to shift a 
plane of data across the CM plane and one cycle to trans- 
fer the single bit plane of data to RAM inside the GAPP. 
Thus the time to transfer 8 bits of data between GAPP 
RAM and external RAM is 96 data shift instructions plus 
8 GAPP RAM data operations. Because of the GAPP 
chip’s unique architecture, processor operations within 
the GAPP array need only be interrupted during the 8 
GAPP RAM operations. Thus, for every RAM operation 
there are 12 CM shift operations that can execute con- 
currently with program execution. !f the user’s applica- 
tion can be processed in a pipelined fashion, the loading 
of new data can take place concurrently with program 
execution on previously loaded data. 


Table 1 provides a program listing that writes to the ex- 
ternal memory. Table 2 provides a program listing that 
reads from the external memory (refer to Figures C and 
D). 


TMGAPP is a trademark of NCR Corporation. 


» TABLE 1. PROGRAM LISTING FOR WRITING DATA TO EXTERNAL RAM. 


GAPP instructions supplied from the instruction queue of the GAPP controller 


Buffer 

Instruction GAPP _ Tristate 

Number Instruction R/W Control 

1 m CM:=RAM 1 4 

2 CM:=CMS 1 1 

3 CM:=CMS 0 1 

. 4 CM:=CMS 0 1 

Istbit (og CM:=CMS 0 1 

plane. \ gs CM:=CMS 0 1 

7 CM:=CMS 0 1 

8 CM:=CMS 0 1 

9 CM:=CMS 0 1 

10 CM:=CMS 0 1 

1 CM:=CMS 0 1 

12 CM:=CMS 0 1 

13 CM:=CMS 0 1 

14 m+1 CM:=RAM 0 1 

‘ 15 CM:=CMS 1 1 

éndbit} 46 CM:=CMS 0 1 
plane ‘ 
8 
e 

26 CM:=CMS 0 1 


Copyright © 1985 by NCR Corporation, Dayton, Ohio, USA. All rights reserved, Printed in IISA. 


External 
Memory 
Address Comments 

- Begin WRITING to external 
RAM by loading GAPP data 
into the CM plane. 

n There is a one cycle pipeline 
delay before write to RAM 
begins. 

n+ 1 

n+2 

n+3 

n+4 

n+5 

n+6 

n+7 

n+8 

n+9 

n+ 10 

n+11 Finish writing bit plane 

to RAM. 

= Start next bit plane. 

n+12 
n+ 22 
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GAPP APPLICATION NOTE NO. 2 


® TABLE 1. CONTINUED 


Instruction 
Number 


27 


104 
105 


m+2 


GAPP 
Instruction 


CM:=RAM 


CM:=CMS 
CM:=CMS 


CM:=CMS 
NOP 


RAV 
0 


1 
0 


0 
0 


Buffer 
Tristate 
Control 


1 


1 
1 


External 
Memory 
Address 


n+ 23 


n+ 24 


n+94 
n+95 


n represents the base address of the external RAM that is written to, 
m represents the base address within GAPP RAM 


Comments 


Finish writing bit plane 
to RAM, 
Start next bit plane. 


Finish writing 8th bit plane. 


# TABLE 2. PROGRAM LISTING FOR READING DATA FROM EXTERNAL RAM. 


GAPP instructions supplied from the instruction queue of the GAPP controller 





Buffer External 
Instruction GAPP _ __ Tristate Memory 
Number Instruction RW — Control Address Comments 
1 CM:=CMS J 1 - Begin READING from external 
2 CM:=CMS 1 0 n RAM. There is a one cycle pipeline 
3 CM:=CMS 1 0 n+ 1 deiay before read from RAM begins. 
4 CM:=CMS 1 0 n+2 
3) CM:=CMS 1 0 n+3 
. 6 CM:=CMS 1 0 n+4 
Wtbit 7 4 CM:=CMS 1 0 n+5 
plane.) 3g CM:=CMS 1 0 n+6 
9 CM:=CMS 1 0 nt+7 
10 CM:=CMS 1 0 n+8 
11 CM:=CMS 1 0 n+9 
12 CM:=CMS 1 0 n+10 
13 m RAM:=CMS 1 0 n+11 Finish reading bit plane 
- from RAM and store in GAPP RAM 
14 CM:=CMS 1 1 = Start next bit plane. 
15 CM:=CMS 0 n+ 12 
2nd bit 16 CM:=CMS 1 0 n+13 
plane bad 
e 
 ] 
26 m+t RAM:=CM 4 0 n+ 23 Finish reading bit plane 
from RAM and store in GAPP RAM 
27 CM:=CMS 1 1 Start next bit plane. 
28 CM:=CMS 1 0 n+ 24 
29 CM:=CMS 1 0 n+25 
i] 
* 
° 
103 CM:=CMS 1 0 n+94 
104 m+? RAM:=CM 1 0 n+95 Finish reading 8th bit plane. 
105 NOP 1 1 — 


n represents the base address of the external RAM that is read from. 
m represents the base address within GAPP RAM 
NCR reserves the right to make any changes or discontinue altogether without notice with respect to any hardware or softwere product or the 
technical content herein. 
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TRISTATE — 
CONTROL RAW ADDRESS 


CMN 







GAPP 
DEVICE 


CMS 


oe TRISTATE 
TRISTATE g 7418241 BUFFER 


BUFFER iN 






1 CYCLE DELAY IN EXECUTION PER PLANE 
12 CYCLES INTERLEAVED PER PLANE 


TRISTATE _ 
CONTROL RAV ADDRESS 


Figure A. Memory expansion of a GAPP based system using six nx bit 
external RAMs. 
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TRISTATE 


CONTROL FW ADDRESS 


7418241 






7418241 
TRISTATE 
BUFFER /N TRISTATE 
Le. BUFFER 
CMN 
GAPP 
DEVICE gis, 
CMS 


ro 


A~A~|! DW 


1 CYCLE DELAY IN EXECUTION PER PLANE 
12 CYCLES INTERLEAVED PER PLANE 


TRISTATE __ 
CONTROL R/W ADDRESS 


Figure B. Memory expansion of GAPP based system using a single nx8 bit 
external RAM. 
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GAPP APPLICATION NOTE NO. 2 


EXTERNAL 
CONTENTS OF GAPP CM REGISTERS CONTENTS OF EXTERNAL RAM RAM 
ADDRESS 
1. m CM:=RAM: 
_ DATA IN 


Cc 


MN 


2.CM:=CMS; 

7 CMN DATA IN 
ee ee is ne 
nogoge v2 
ol vs 
3. CM:=CMS:;: 

~ DATA IN 


cs 
> 
= 
aD 
> 
= 
nN 
x 
> 
= 
Ww 
Ps) 
> 
= 
P= 
ms) 
> 
= 
ol 
P) 
> 
= 
mo 


Figure C. Diagram of data transfer from location m in GAPP RAM into external RAM via the CM registers. See program 
listing in Table 1, Shown above are the results of the first 3 GAPP instructions referred to in Table 1. On the 
left, each square represents the CM register for each processing element in the GAPP processor array. On the 
right, each column represents an nx1 bit external RAM. Each square within a column represents a distinct 
RAM location. The column to the far right lists addresses for each location with the hightighted address repre- 
senting the external RAM location being accessed in the current instruction cycle, 


GAPP APPLICATION NOTE NO. 2 


EXTERNAL 


CONTENTS OF GAPP CM REGISTERS CONTENTS OF EXTERNAL RAM RAM 
ADDRESS 


1. CM:=CMS; 


ge 


es aie ee 

CMS DATA OUT 
2.CM:=CMS; 

pol, eh ee fee) ee 

eel [sap ey Ta ee. ae 
oe ee es Fa I PP 
eeoreoue Mey (sal Pedy. ke) see: GZ) ee 
CMS DATA OUT 
3.CM:=CMS; 

coe [eel lek) (eee ee, ey 

Fe Ean a 

oyolelelaty) (ee feb fe) yey pe 
Fess eS ( 
CMS DATA OUT 


RAM1 RAM 2 RAM 3 RAM 4 RAM 5 RAM 6 


Figure D. Diagram of data transfer from external RAM into CM registers of a GAPP device. After data is shifted into the 
CM registers it is then transferred to GAPP RAM. See program listing in Table 2. Shown above are the results of 
the first 3 GAPP instructions referred to in Table 2. On the Jeft, each square represents the CM register for each 
processing element in the GAPP processor array. On the right, each column represents an nx1 bit external RAM. 
Each square within a column represents a distinct RAM location. The column to the far right lists addresses for 
each location with the highlighted address representing the external RAM location being accessed in the current 
instruction cycle. After an entire bit plane is loaded into CM from external RAM it is then stored in GAPP RAM. 
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NCR45CM16 





CMOS 16 X 16 BIT 
SINGLE PORT MULTIPLIER/ACCUMULATOR 


# GENERAL DESCRIPTION 


The NCR45CM16 is a 24 pin CMOS multiplier/accumulator for use with 16-bit microprocessor systems. All input and output 
data are transferred through a single 16-bit bidirectional data bus in signed two’s complement format. This device is TTL/ 
CMOS compatible and requires no clock due to its tota! static (asynchronous) operation. The device may be attached to the 
system bus in the same way as a 16-bit wide static RAM. A single 16 x 16 multiply and read 32-bit result requires 5 cycles 
{write X, write Y, multiply, read high-order result, read low-order result). Pipelined multiply/accumulate operations require 
only 2 cycles each. 


«s FEATURES 
@ 24 Pin Package @ Low Power CMOS 
— 300 mil Ceramic “Skinny DIP” — 100uW Standby (max) 
— 600 mil Plastic DIP — 10mA Operating (max) 
® 40 bit Accumulator e@ Single 5 Volt + 10% Supply 
— Add Product to Accumulator ® Fully Static Operation — No Clock Required 
-~ Subtract Product from Accumulator ® 3-state Bus Compatible Outputs 


® Cycle Time 190 ns (typ) 












s PIN CONFIGURATION ® FUNCTIONAL BLOCK DIAGRAM 
cs 1 cs 1 Vpo 
a3} 2 A3(}.2 A2 
WE 3 WE 3 Al 
DB 4 DB ‘4 AO z X-REGISTER 
po 5 boc 46 20-7 Db? 
Dio C16 b10 LY 6 DE 
p11 7 DI 7 D5 . 
O12 B 012 8 D4 is] 
p13} 9 bi3 (+8 03 + 
p14 [ho p14 ho 92 Do- Dis} 2 MULTIPLIER 
D15 D15 1 D1 5 ARRAY 
GND GND Ch DO fe) 
-~ 
NCR45CM16-P NCR45CM16-C = 
Plastic DIP Ceramic DIP z 
= PIN NAMES 






[00-015 _| Dota Inputs Output | 
S| Chinselect 
We [write Enable | 


Vpp 5V + 10% 
Supply Voltage 


* Specifications are subject to 
change without notice. 






"ACCUMULATOR 


| MULTIPLEXER 
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s ABSOLUTE MAXIMUM RATINGS 


Supply Voltage, Vpp..-- 6 ee ee ees +7V Stresses above “absolute maximum ratings’ may result 
Voltage on any pin with respect in damage to the device. Functional operation of devices 
toground............0.0ees ~0.3 to Vpp + 0.3V at the “absolute maximum ratings’ or above the recom- 

mended operation conditions stipulated elsewhere in this 
Storage temperature... .......--.-5 —65°C to 150°C specification is not implied. 


« RECOMMENDED OPERATING CONDITIONS 


Parameter 


Supply voltage 


tnput high level voltage 
Input low Jevel voltage 
Operating ambient temperature 





® STATIC ELECTRICAL CHARACTERISTICS 
OVER RECOMMENDED OPERATING CONDITIONS 


Input leakage current Vin=0V to Vpp max 

Output leakage current V920.4 to Vpp max 
CS+1 

Output high voltage current loH=400pA 

Output low voltage lop=2.1mA 

Supply current — Active Outputs Open 

Supply current — Standby 





*Typical limits are Vpp = 5.0V, Ta = 25°C; typical parameters are not guaranteed 


s CAPACITANCE Ta = 25°C, f= 1 MHz 


Input capacitance All pins except pin 


Input/Output capacitance under test are tied to 
ground 





a Pe aE 
2NCR reserves the right to make any changes or discontinue altogether without notice with respect to any hardware or software product or the 
technical content herein. 
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NCR45CM16 


READ OPERATIONS (WE = 1) 


pas | az | ay | to OPERATION 


fx | o | o fo Read bits 0 through 15 of result from accumulator 
rakee cea Read bits 16 through 31 of result from accumulator 
re o64| 08 (a i) ae | Read bits 32 through 47 of result” from accumulator 










DIVIDE BY 2 AND READ (WE = 1) 


OPERATION 


‘Read bits 1 through 16 of result from accumulator 


Read bits 17 through 32 of result from accumulator 


Read bits 33 through 48 of result from accumulator” 






X = Don’t care 


*NOTE: Accumulator accumulates to 40 bits, Thus bits 0 - 39 are valid, while bits 40 - 48 are a sign extension of bit 39. 


WRITE OPERATIONS (WE = 0) 


pas | Aa ACCUMULATOR OPERATION | Ay MULTIPLIER OPERATION 





















Pwr Lt | co | Writenewcetatox 
Subtract X- Y from A Bee ee Weite-newideleteboieocand ¥ 





= Accumulator 
Data latched into X-register 


Data latched into Y-register 








NCR45CM16 


» EXAMPLE OPERATIONS 


1. Multiply two 16-bit numbers, read 32-bit result 
Instruction WE Operation 


0010 0 Clear A, Write X 

0001 0 Ciear A, Write Y 
0100 0 AddX-YtoA 

0000 1 Read low order result 
0001 1 Read high order result 


2. Multiply two 16-bit numbers and accumulate, repeat five times (five point digital filter}, read 40-bit result = X,Yy + 
X2Y2 + X3V3 + XaVq + X5 V5 


Instruction WE Operation 
0010 0 Clear A, Write X 
0001 0 Clear A, Write Yy 
0110 0 A= X1- 1, Write Xo 
1001 0 Write Y2 
0110 0 A=X1°V¥1+X%2° Vo, Write X3 
1001 0 Write V3 
0110 0 A=X1 + V1 +Xq° V¥o+Xaq-° Ya, Write Xq 
1001 0 Write Ya 
0110 0 A=Xy°V¥44Xo° V¥o+ Xq + V3 + Xa Va, Write X5 
1001 0 Write Ys 
0100 0 A=X- Vy + Xq-¥o+X3 Vat Xe Vat Ms V5 
0010 1 Read most significant bits (32-47} of result 
0001 1 Read bits 16-31 of result 
0000 1 Read bits 0-15 of result 


3. Half of sum of squares = % (By? + Bo?) 
Instruction WE Operation 


0011 0 Clear A, Write B, to Registers X and Y 

0111 0 A=B,2, Write Bz to Registers X and Y 

0100 0 A=B,2 + Bo? 

0101 1 Divide A by 2 and read most significant 16 bits 


4. Scale a series of numbers by a constant 
Instruction WE Operation 


0001 0 Clear A, Write Constant to Y 
1010 0 NOP A, Write X, 
0100 0 A=X,-:¥Y 

000% 1 Read high order result 
0000 1 Read low order result 
0010 0 Clear A, Write X2 
0100 0 A=Xo:Y 

0001 1 Read high order result 
0000 1 Read low order result 
0010 0 Clear A, Write X3 
0100 0 A=X3°Y 

0001 1 Read high order result 
0000 1 Read tow order result 
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NCR45CM16 


® ACCHARACTERISTICS vVop = 4.5 to 5.5V, Ta = 0 to 70°C, Vi, = 0.0V, Vin = 3.0V 
READ CYCLE 


Read Cycie Time 
Address Access Time 


Chip Select to Output Data Valid 
Write Enable Set Up Before Select 
Read Recovery Time 

Chip Deselect to Output High-Z 
Data Hold from Read Time 


READ CYCLE TIMING WAVEFORMS 
The read operation is performed with WE = high. The falling edge of C5 latches the address and initiates the read process. 


-——_— tac ————_—} 


" Kita, , 1,3S 
ltwes { } 


tco—*| f--tan-el 





ge) 
cae” taa 1 tonr} 


H I 
paTAout ——ttz___¢ XY fora vauio< XX — 


WRITE CYCLE 


PARAMETER 


Write Cycle Time 
Address Valid to End of Write 
Write Enable Set Up Before Select 


Write Recovery Time 

Write Pulse Width 

Data Set Up to Write Time 
Data Hold From Write Time 
Write Enable to Output Hi-Z 





WRITE CYCLE TIMING WAVEFORMS 
The write operation is performed with WE = low. The falling edge of CS latches the address and the rising edge of CS latches 
the data in. 


ADDRESS 


wa ot twee 
_ | SL eae 
CS { ! 
| 
| 
I 





1 lton 
to 


DATAWN | i_varavauio DC 


pe— wz —> 


} , 





NCR45CM16 
® AC TEST LOAD CIRCUIT 


OUTPUT UNDER 
TEST 


*Includes jig capacitance. 
Ali diodes 1N3064 or equivatent. 





CAUTION 


1. CMOS Devices are damaged by high energy electrostatic discharge. Devices must be stored in conductive foam or 
with all pins shunted. 


2. Remove power before insertion or removal of this device. 


Cis NCR Microelectronics Division 2001 Danfield Ct. Fort Coltins, Colorade 80525 
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ERRATA 


NCR45CM16 SINGLE PORT MULTIPLIER DATA SHEET 


The "WRITE OPERATIONS" table on page 3 of the data sheet implies 
that data can be written to the Y register while simultaneously 
adding or subtracting the previous XY product into the accumula- 
tor. This CANNOT be done in all cases. The following table 
replaces the WRITE OPERATIONS table at the bottom of page 3. Note 
that op-codes 0101 and 1101 are invalid. 


WRITE OPERATIONS (WE=0) 


A3 A2 A] Ag Operation A3 A2 Aj Ag Operation 


0 0 0 0 Clear Accumulator 0 0 0 Retain Accumulator 
Retain X and Y Retain X and Y (NOP) 


Clear Accumulator Retain Accumulator 
Write new data to Y Write new data to Y 


Clear Accumulator Retain Accumulator 
Write new data to X Write new data to X 


Clear Accumulator Retain Accumulator 
Write new data to Write new data to X and Y 


Add X Y to Accumulator Subtract X Y from Accun. 
Retain X and Y Retain X and Y 





0 1 0 21 Invalid Operation Ey Or Invalid Operation 


Add X Y to Accumulator Subtract X Y from Accun. 


Write new data to X Write new data to X 





011i Add X Y to Accumulator 121 411 Subtract X Y from Accum. 
Write new data to X and Y Write new data to X and Y 








APPLICATION NOTE M-1 
NCR45CM16 


MICROPROCESSOR MULTIPLICATION ACCELERATOR 


As many assembly language programmers can attest, per- 
forming multiplication operations with a microprocessor 
can take a great amount of time. The unaided micropro- 
cessor is especially slowed down by repeated multiply- 
accumulate operations that are common in process con- 
trot or digital signal processing applications. This reduced 
performance limits the maximum bandwidth signal that 
the general purpose microprocessor can handle, 


The alternatives, however, for improving the effective 
throughput of the processor are expensive. Previously 
the system designer could add a special purpose array 
processor board to his system, or redesign his system to 
use a@ special purpose DSP microprocessor. Both of 
these options require high expense or extensive engi- 
neering which may not be justified for many applica- 
tions. 


Another solution, that of adding an expensive three port 
multiplier chip with the associated latches and fogic re- 
quired to interface it to the system, can take up a large 
amount of system board space and consume an inordi- 
nate amount of power. 


On the other hand, a small, low power multiplier chip 
that could be interfaced to the system with little addi- 
tional circuitry would be an attractive solution to the 
throughput problem. NCR has developed a micropro- 
cessor bus compatible, 16 x 16 multiplier (NCR45CM 16) 
which is designed specifically as a microprocessor ‘‘mul- 
tiplication accelerator”. It is packaged in a small 24-pin 
DIP and typically consumes only 5mA while cycling 
through multiply-accumulate operations at a5 MHz pace. 
One important feature of the device is its simple system 
interface. The NCR45CM16 attaches to the micropro- 
cessor bus and appears to the system as a 200-ns, 16-bit 
wide static RAM. Figure 1 shows the size of the 
NCR45CM16 package next to a conventional three-port 
multiplier. The smal] size of the single port device will 
allow its incorporation into many existing microcom- 
puter boards. 





Figure 1. Comparison between the NCR45CM16 (below) 
and conventional three-port multiplier/accu- 
mulator chip (above) clearly shows that the 
bulky three-port does not size up for space 
limited microprocessor board designs, 
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e USE IN A SYSTEM 


The multiplier chip is most easily used if it is mapped 
directly into the processor's memory space. This is be- 
cause the device has Chip enable (CE) and Write enable 
(WE) pins that perform the same functions as they 
would for a static RAM. When the device ts not enabled 
the 1/0 pins will go into a high impedance state that ef- 
fectively disconnects the multiplier from the system bus. 
As shown in Figure 2, the chip has input registers X-REG, 
and Y-REG that are written to through the single port 
bus interface. The product of these registers is then 
available for an accumulate operation on the next cycle. 
This ‘product’ may be added to or subtracted from the 
40-bit accumulator while the X register is simultaneously 
updated. The result in the 40-bit accumulator may be 
read 16-bits at a time: least significant 16, most signifi- 
cant 16, or high significant 16. The latter is produced 
only with repeated multiply-accumulates that create a 
result greater than 32-bits. Figure 3 provides details of 
the multiplier operation. Contro! of the input registers, 
output registers and accumulator operation is deter- 
mined by bits of the address bus (Ag-A3}. 


For a series of multiply-accumulate operations (such as 
an FIR filter computation), the device can operate as 
a two cycle pipetine (Write to X-REG and accumulate, 
Write to Y-REG), After the fast arithmetic operation, 
three read operations would be required to obtain the 
full precision output. A multiplier-aided-68000 or 8086 
will be approximately three times faster than an unaided 
68000 or 8086 microprocessor using only the internal 
multiply instruction. 


» MICROPROCESSOR INTERFACE 


The NCR45CM16 is easily interfaced to both the 68000 
and 8086 microprocessors. Typical interface circuitry 
for both micros can be seen in Figures 4a—4c, Examples 


2) 
BEre alg 


INPUT/OUTPUT PORT 







MULTIPLIER 
ARRAY 


CONTROL 


Figure 2. Functional Block Diagram 


of 68000 and 8086 assembly code used with the mul- 
tiptier are included at the end of this application note. 


NCR reserves the right to make any changes or discontinue altogether without notice with respect to any hardware or software product or the 


Bech ical corr 


-—~j 


= eee 


| 


po ee ee a 


=: 
{ 


| 


e 


i 


— , = 


Sara ay 
[ie XI | 


f 





NCR4S5CM16 


READ OPERATIONS (WE = 1) 


OPERATION 





“NOTE: Accumulator accumulates to 40 bits. Thus bits 0 - 39 are valid while bits 40 - 47 are a sign extension of bit 39. 


DIVIDE BY 2 AND READ (WE = 1) 


OPERATION 





X = Don't care 


*NOTE: Accumulator accumulates to 40 bits. Thus bits 1 - 39 are valid, while bits 40 - 48 are a sign extension of bit 39. 


WRITE OPERATIONS (WE = 0) 
[As[ Aa] Ay[ Ao] OPERATION [as [Aa] A] Aa) 
Clear Accumutator 
Retain X and Y 
Clear Accumulator 
Write new data to Y 
Clear Accumulator Retain Accumulator 
Write new data to X Write new data to X 
Clear Accumulator Retain Accumulator 
Write new data to X and Y Write new data to X and Y 























OPERATION 

Retain Accumulator 
Retain X and Y (NOP) 
Retain Accumulator 
Write new data to Y 


Add X« Y to Accumulator Subtract X * Y from Accum. 
Retain X and Y 


Retain X and Y 
Invalid Operation Invalid Operation 
Add X + Y¥ to Accumulator Subtract X * Y from Accum. 
Write new data to X 
Add X « Y to Accumulator 
Write new data to X and Y 


Write new data to X 
Accumulator 


Subtract X * ¥ from Accum. 
Write new data to X and Y 





Data latched into X-register 
Data latched into Y-register 





Figure 3. Read and write operations of the NCR45CM16 
are determined by 4 address pins (Agp-A3) and the 
write enable (WE) pin. 
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MC6800 to 45CM16 INTERFACE 








68000 45CM16 


DpOo-D15 pDo-D15 


RW 


A1-A4 


DECODER 


Figure 4a. 
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2 EXAMPLE: 


Using the NCR45CM16 for Assembly Language Multiplication/Addition — 68000 Application 


The NCR45CM16 can speed up compute-bound prob- 
lems on 16 bit microprocessors. One application that 
benefits from adding a 45CM16 is the computation of 
the sum of products: 


SUBROUTINE: Sum Products 

AO points to the first element in the X list 

A1 points to the first element in the Y list 

DO contains the number of products to be summed 


NCRAREA EQU XXXX 
e 8 ©6©Offsets for writes 
WXYCLRA EQU- $3 
ADDXYWX EQU $6 
WRITE_Y EQU $9 
@ Offsets for reads 
A_LOW EQU $0 
AMID EQU- $1 
A_HIGH EQU $2 
e Executable code: 
START MOVE .W #NCRAREA, A2 
CLR .WWXYCLRA » 2 (A2) 


LOOP MOVE . W (A0) +, ADDXYWX « 2 {A2) 


MOVE . W (A1) +, WRITELY * 2 (A2) 
DBF DO, LOOP 

MOVE . W DO,ADDXYWX * 2 (A2) 
MOVE . W ALLOW * 2 (A2), D1 
MOVE .W AMID ® 2 (A2), D2 
SWAP 02 

MOVE .W D1, D2 

MOVE . W A_HIGH * 2 (A2), D3 
RTS 


Of course, the subroutine can be made to execute even 
faster by using separate address registers to hold write 
and read locations instead of using offsets, But even in 
the above, register-conserving approach it is clear that 
using the 45CM16 to do the multiply-and-accumulate 
loop greatly reduces the overhead and shortens the code 
of the corresponding loop for an unaided 68000. With- 
out the 45CM16 a programmer would have to use the 
68000's own signed multiply instruction and a 32 bit 
addition even to accumulate to just 32 bits. This re- 
quires 82 machine cycles of execution time for the un- 
aided 68000 versus either 24 or 32, depending on the 
addressing mode, for the same operations done through 
the 45CM16 in a loop. 


The largest disadvantage to the unaided approach, how- 
ever, is the overhead required to do accumulation. With 


Result of product sum returned in low byte of D3 plus D2 
Define memory mapping for relevant 45CM16 instructions: 


A= (X1) © (¥1) + (X2) @ (Y2) 4+...4+ (Xn) © (Yn). 
Code for implementing this algorithm on the MC68000 
is given below: 


BASE ADDRESS FOR 45CM 16 1/0 


WRITE TO BOTH X, Y; CLEAR A 
ADD X «+ Y TO A; PUT NEW DATA IN X 
WRITE NEW DATA TO Y 


LOW WORD OF 40 BIT ACCUM, 
BITS 16-310OF A 
BITS 32-4? (40-47 EXTENDED) 


SEND 0’s TO X, Y, ANDA 
MULTIPLY/ACCUMULATE, NEXT X 
WRITE NEXT Y 


LAST MULTIPLY/ACCUMULATE 
FETCH LOW WORD IN ACCUMULATOR 
FETCH BITS 16-31 INA 

MOVE MIDDLE A WORD TO HIGH D2 
CONVERT TO SINGLE 32-BIT 

FETCH HIGH ACCUMULATOR WORD 


the 45CM16 at least 256 products can safely be added 
and the high byte of the 40 bit accumulator can be 
fetched at the end by the 68000. Without the 45CM16, 
adding 32 bit quantities in succession requires overflow 
checking and updating the bits in a second data register 
on each addition. This results in still further delay in the 
loop. 


In short, there is more code and more delay in the un- 
aided multiply-and-accumulate loop than in a similar 
loop executed through the 45CM16. With the multipli- 
cation accelerator, much code becomes unnecessary and 
the only additional code required for communicating 
with the 45CM16 is that which fetches the result from 
the accumulator at the end. 
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Using The NCR45CM16 For Assembly Language Multiplication/Addition on the 8086 


The NCR45CM16 MAC can be interfaced with the Intel 
8086 bus in two ways, memory mapped mode or [/O 
mode. In the memory mapped mode, the 45CM16 acts 
as a 16-bit wide RAM device connected to the 8086 bus. 
Data transfer between the 8086 and the 45CM16 is 
achieved by using one of the 23 different addressing 
modes available with the MOV instruction. !n the |/O 
mapped mode, the 45CM16 acts as a 16 bit wide I/O 
device or peripheral connected to the 8086 bus. Data 
transfer between the 8086 and the 45CM16 is achieved 
by using only two I/O instructions, IN and OUT. 


The advantage of connecting the 45CM16 in the mem- 
ory mapped mode is the availablity of a large number of 
addressing modes for data transfer. However, all of these 
modes, except one, have higher execution times than the 


simple IN and OUT instructions associated multipty/ 
accumulate operations on the 8086, the I/O interfacing 
mode is used in this example. Comparing the unaided 
8086 to the 45CM16 used in the 1/O mode indicates a 
speedup of approximately 3X for the multiply operation. 


The following assembly language subroutine is used to 
calculate SUM as 

SUM = X (1) * ¥ (1) + X (2) * ¥ (2) + 

X(3) © ¥ (3) 4 eas +X (N) * ¥ (N} 

The 45CM16 and data arrays are mapped into the mem- 
ory space 00 hex to FF hex, An Intel !ntellec IV devel- 
opment system was used to write the subroutine. 
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BEHIND THE COVER 





depart completely from the traditional! 

von Neumann architecture is a heady under- 
taking. For a company involved in commercial mi- 
croelectronics for only three years, it might seem 
brash, but that didn’t faze NCR. 

“I’m not sure that the management believed in 
the systolic array processor all that much,” recalls 
Paul Sullivan, “but they let us go ahead.” Indeed, he 
admits to having no more than a passing interest in 
systolic arrays when the engineers at Martin 
Marietta brought up the concept in a 1982 meeting. 

But Sullivan, the head of advanced development 
at NCR’s Microelectronics Division, was soon con- 
vinced of the array’s importance, and his enthusi- 
asm proved contagious. The result —an architecture 
that links a number of processors handling data 
quickly in parallel—is the topic of this issue’s cover 
story (p. 207). 

Less than a year after the project got under way, 
first silicon rolled off the line. Although the initial 
array of 3 by 6 processors came in at a staggering 
200,000 mils? (the fabrication personnel thought it 
ridiculous to continue), it proved that the design 
was sound. Further, it furnished a breadboard to 
help assemble the final instruction set. 

Cutting the chip down to size—by half—while 
quadrupling the number of processors was no easy 
task. In fact, the 6-by-12-processor array was the 
first NCR part to incorporate a second layer of 
metallization, as well as the densest CMOS device 
the company ever ran through fabrication. “I'm stil] 
not sure that tackling both these firsts simul- 
taneously was a good idea,” confesses Sullivan. 

Finding suitable CAD tools for such a complex 
chip was a problem in itself, particularly because of 
the difficulty in checking design rules at the densi- 
ties involved. Also, the amount of information that 
made up the design data base was so massive that 
most machines employed in fabricating masks 
simply couldn't accept it. 

In a project this ambitious, teamwork was essen- 
tial. At the start, though, senior circuit designer 
Dave Thomas was about the only individual assign- 
ed full-time to the processor. He did stay in close 
contact with Martin Marietta’s Wlodzimierz Holsz- 
tynski—the mathematician responsible for the ar- 
chitecture of the processor elements. 


Cu one of the first processor chips to 
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Systolic array chip 
matches the pace of 
high-speed processing 





A monolithic systolic array packs 72 single-bit 
parallel processors, letting it clip along at 
the rates demanded to process images in real time. 





This is the first in a series dealing with systolic 
arrays. Subsequent articles will investigate 
such applications as pattern recognition, image 
manipulation, and data-base management. 


processing, automated inspection, and 
artificial intelligence clearly reveal the 
limitations of the tradi- 
tional von Neumann ma- 
chine. Since that architec- 
ture handles only one piece 


Te herculean demands of real-time image 
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of data at a time, it severely constrains the 
speed with which information can be processed. 
Further, because incoming data must be held 
until it can be put to use, the approach inher- 
ently calls for a large amount of memory. 

Some problems are alleviated by turning to 
parallel architectures, in which micropro- 
cessors or semicustom devices are linked. The 
sheer number of com- 
ponents involved, though, 
makes the size of such 
systems an obstacle in its 
own right. In short, one set 
of difficulties is traded off 
for another. What’s more, 
parallel processing does lit- 
tle to cut back on storage 
space. 

A solution is finally at 
hand, in the form of the 
first commercial systolic 
array processor chip. The 
Geometric Arithmetic 
Parallel Processor (GAPP) 
overcomes the intrinsic 
problem associated with 
the von Neumann comput- 
er by loading 72 bit-serial 
parallel processor cells on- 
to a single IC (see “Systolic 
Arrays: The Heart of 
the Matter,”). Each of 
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the processor elements contains an ALU and 
128 bits of RAM, as well as bidirectional com- 
munication lines that connect the cell to its 
neighbors on the north, south, east, and west. In 
addition, a separate I/O communication bus al- 
lows data to be input from the south end of the 
array and output to the north without inter- 
fering with computation within the ALU. 

Central to the IC’s makeup is its single- 
instruction, multiple-data architecture, which 
distributes processing power among each of the 
identical bit-serial processor elements. Further 
boosting speed in a large number of applica- 
tions is the ability to cascade a number of ICs to 
form large arrays. 

Interestingly, the processor elements them- 
selves are not particularly fast, taking 2.5 us to 
add two 8-bit numbers. Executing 72 such oper- 
ations simultaneously, though, yields an over- 
all data rate of 28 million additions per second 
for each device. 

Assembling an array of chips also eliminates 
the bandwidth limitations that plague von Neu- 
mann machines. For instance, a 48-by-48-cell 
systolic processor —comprising 32 chips—can 


grab a 48-bit-wide word every 100 ns when oper- 
ating with a 10-MHz clock. The array’s band- 
width thus equals 480 Mbits/s. 

In pattern recognition and automated in- 
spection, the chip’s ability to handle entire 
images concurrently, instead of one pixel at a 
time, as is traditionally done, greatly speeds 
throughput. Since images are taken care of by a 
single chip, interactions with the host are elim- 
inated, as are the inherent restrictions of 
memory transfers using the system bus. And 
because a single cell can be mapped for each pix- 
el in an image, adding more chips pushes speed 
even higher. 


A juggling act 


The systolic array also is particularly suited 
to digital signal processing. Unlike single- 
instruction, single-data computers which can- 
not simultaneously calculate a host of basic 
operations (like multiplication, convolution, 
and trigonometric functions), the chip can per- 
form several of these common signal-process- 
ing operations concurrently. 

Additionally, to minimize the number of data 


Systolic arrays: The heart of the matter 


A systolic array is a regular arrangement of sim- 
ple, identical processor elements that are connected 
to their nearest neighbors. The term “systole” was 
originally used to refer to the recurrent contractions 
of the heart. As with the human circulatory system, 
systolic computations are characterized by the 
pumping of data through an array of processor ele- 
ments. While data moves in and out of the processor 
element, some operation is performed on it during 
each cycle. This maintains a regular flow, or circu- 
lation, of data within the network. Although defini- 
tions vary, systolic processors must first of all run in 
sync with a global system clock, so that data is rhyth- 
mically computed and passed through the network. 

An array can be extended arbitrarily by con- 
necting two or more processor elements to increase 
speed linearly with the number of elements. A good 


measure of the efficiency of an array processor vis-a- 
vis a single processor is the so-called speed-up factor, 
which is defined as the processing time for a single 
processor divided by that of an array. 

The systolic architecture is a natural one; it is a 
subset of the cellular automaton—a uniform array of 
many identical cells in which each cell interacts only 
with its neighbors. Interestingly enough, it was the 
father of conventional computer architecture, von 
Neumann, who performed some of the earliest in- 
vestigations into the cellular automaton structure as 
a potential machine configuration. Harbingers of 
today’s systolic arrays are the Iliac IV system, devel- 
oped in the late 1960s, and the massively parallel 
processor built by Goodyear Aerospace in 1981. 
Systolic chip architectures were developed at 
Carnegie-Mellon. 
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fetches normally required, reads to or writes 
from RAM can be performed on every cycle—at 
the same time as computations. That ability is 
significant in real-time processing and many 
number-crunching applications, which are by 
nature memory-intensive. It also lends itself to 
artificial intelligence, in which volumes of data 
requiring extensive parallel processing are the 
rule rather than the exception. 

The chip operates at 5.0+0.5V and dissipates 
500 mW ata 10-MHz clock rate. Data setup and 
hold times are 10 and 5 ns, respectively. Al- 
though 72 processors are now riding on the 
100,000-mil* chip, future versions will go far be- 
yond that. The 6-by-12 array is now fabricated 
with a 3-um double-layer meta! CMOS process, 
but shrinking line widths to 1 wm will crowd 512 
cells onto an equal-sized chip. Clearly, such den- 
sity makes CMOS the technology of necessity. 

The array is housed either ina ceramic 
84-lead pin-grid array or a plastic chip carrier 
(with the same number of contacts). As the 
number of processing elements is increased to 
512, only 162 pins will be needed, not quite dou- 
ble the number now employed. 


A peek inside 


Each processor element contains separate 
lines that link the cell to its neighbors and tothe 
outside world. In addition to the North South 
(N/S) and East West (E/W) lines that pass 
data between cells, are the CM South input 
(CMS) and CM North output (CMN) (Fig. 1). 
There is also a complement of 22 external signal 
lines: 7 address lines (Ay through Ag), 13 control 
lines (Cy through C2), one global output (GO), 
and one clock (CLK). 

The chip’s overall simplicity is reflected in 
the layout of a single processor element (Fig. 2). 
Each of its four latches—CM, N/S, E/W, and C 
(referred to as the C register)— accepts data 
from up to eight possible sources, depending 
upon the setting of the control lines. Cy and C, 
control the input to the CM latch; C, through C, 
govern the input to the N/S latch; C, through C; 
manage the input to the E/W latch; and Cz 
through Cyo, the input to the C register. Lines 
C,, and C,, handle reads to and writes from the 
128-bit RAM. 

Working from a truth table, the array per- 
forms additions and subtractions (Table 1 ). The 





C, NS, and EW inputs to the multiplexers repre- 
sent the contents of the C, N/S, and E/'W reg- 
isters, respectively. The summing output of the 
single bit ALU, SM, goes directly to the RAM 
and may also be simultaneously input to any of 
the four registers. The Carry and Borrow out- 
puts (CY and BW, respectively) are open to the 
C register. A truth table is used as well to fulfill 
single- and dual-input logic functions—logical 
complement, exclusive-OR, exclusive-NOR, 
logical AND, and logical OR—on data in the 
N/S and E/W latches (Table 2). 


What are your instructions? 


The chip is programmed with a sequence of 
instructions that, when compiled by an assem- 
bler, directs the appropriate control signals to 
every cell in the array. Up to five commands, 
one from each of the five groups that make up 
the overall set, can be executed simultaneously 
on every instruction cycle. The possible combi- 
nations of horizontally microcoded instruc- 
tions results in nearly 6000 commands (see 
Table 3). 

Images are manipulated by the array pro- 
cessor at a brisk pace. A 3-by-3-pixel mask with 


1-bit ALU 


4 registers 


RAM 
{128 bits) 


1. individual processor elements link to their next- 
door neighbors on the north, south, east, and west 
over bidirectional communication lines N/S and E/W, 
respectively. The control lines C, through C,2 and ad- 
dress lines A, through A, establish the signal paths to 
the outside world. Each cell contains a 1-bit ALU, four 
registers, and a 126-bit RAM. 
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an 8-bit gray scale can be convolved with an 
8-bit gray-scale image in less than 300 us, using 
what is termed a global broadcast operation. In 
that mode, a single bit (1 or 0) is transmitted to 
the C register of every processor in the array 
(by toggling control line Cs). Each processor 
will perform the same function on the broad- 
cast data, thereby increasing throughput. Since 
mask data can be broadcast globally, every pix- 
e] in the mask can operate simultaneously on 
the entire image. Thus, the 3-by-3 convolution 
can be accomplished with nine sets of global 
operations that include multiply, shift, and add. 


Similarly, a 9-by-9-pixel mask with an 8-bit 
gray-scale image can be convolved with an 8-bit 
gray-scale image in less than 5 ms. The 8 bits of 
gray scale furnish up to 256 shading intensities 
(from black to white)— better than the video 
signals coming from most TV cameras. 


To the north! 


As mentioned, the processor’s speed can be 
enhanced by linking several chips to create 
larger arrays. Moreover, doing so does not re- 
quire any changes in software. In addition, the 
systolic array has a Data Communication line 


Multiplexers 


Full 
adder, 
subtracter 





2. The layout of a single processor cell mirrors the simplicity of the 
overall array design. Each of the four registers, CM, N/S, E/W, and C, 
accepts data trom up to eight sources. The settings of the control lines 
determine which information is sent to each register. 


cessing, the parts that make up an image plane 
can be shifted to the center of the convolution 
window before the required multiplication is 
performed. That increases efficiency because 
two 8-bit integers form a 16-bit product. If the 
shift were not carried out first, the product 
would have to be shifted bit-seriaily to the cen- 
ter of the window. 

At the end of the convolution, the chip will 
have performed a total of nine global 8-bit mul- 
tiplications, twelve shifts, and nine 16-bit addi- 
tions. The approximate execution time for each 
multiplication is 25.2 us; for a shift, 2.4 us; and 
for addition, 4.9 us. Thus a total of just 299.7 us 
is expended on the operation. 

A binary correlation mask using a binary im- 
age is conceptually identical to convolution, but 
bit-wide exclusive-OR operations are worked 
with instead of multiplications. The correlation 
creates a level of comparison, so that a decision 
threshold can be established to determine 
whether a match is close enough to meet system 
requirements. A score of 441 denotes a perfect 
match; 0 indicates an inverse image. Thresh- 


. €W: = RAM 0: C: « 0/Load Pattern into E/W Ragisters 

. EW: = €/Shift Right, Then Load into E/W 

. NS: = EW EW: = E/Load Shifted Pattern into N/S 
Then Shitt Aight and Load into 

. EW: = RAM 0; 
BW Is N/S-E/W 


. C= CY 
. RAM 1: = C/Load Output Drive 


Esw 
C: = BW; NS: = 0/Retoad Original Pattern, 


3. A simple six-step algorithm allows the array pro- 
cessor chip to recognize s 101 pattern. When the chip 
accepts an “X blank X” (or 101) pattern from a camera 
(a), it indicates a match by putting a 1in the output 
plane (b). Every 1 in the output plane indicates the 
first X of the pattern. 





olds can be set at any level in between to deter- 
mine pass or fail. 

Processing a 21-by-21-pixel binary cor- 
relation mask takes one exclusive-OR, 1536 
shift operations, and 400 additions. The 
exclusive-OR and shift operations take 300-ns 
apiece, and additions take 1.6 us. Total exe- 
cution time is thus 1.3 ms. 


Moving pictures 


In an image-processing system, a series of 
chips can be combined to perform various func- 
tions. For example, a Multibus-based setup can 
be built around an arrangement of 48-by-48- 
processor cells that store and manipulate 
image data (Fig. 4). Incoming video data is 
temporarily stored in a row of eight GAPP de- 
vices that serve as a line buffer. After an 8-bit 
analog-to-digital converter processes the im- 
age, the corner-turning row, as it is called, ac- 
cepts an 8-bit gray scale value for each pixel. 

While the camera is horizontally retracing, 
the video data is shifted from the buffer into the 
processing array. The first step of this shift is 
the corner-turning operation, which is per- 
formed by switching data from the EW latch 
into the CM latch of the line buffer. As each 
pixel’s 8 bits are clocked out of the video line 
buffer, they are transferred into the bottom 
row of processing elements, where they are 
stored in the internal RAM at addresses 0 
through 7. At the same time, the previous video 
line is shifted up one row. The entire operation 
requires 18 instructions, well within the array’s 
speed limitations. Operating at a 10-MHz clock, 
the chip can execute as many as 120 instruc- 
tions in the 12-us horizontal retrace period of a 
typical camera. Once the entire video frame has 
been loaded into the array, computations can be 
performed. 

The end result of these computations is a pro- 
cessed video frame that is held in the internal 
RAM of the cascaded arrays. Data can be sent 
from the arrays using the same instructions 
that were followed to load it. 

A feature that significantly increases 
throughput is the ability to transmit data and 
load in a new frame concurrently. For instance, 
if frames A and B are loaded into the array, the 
resultant frame, A’, is being computed while a 
third frame, C, is being loaded. Similarly, as A’ 
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known as dilation: The image is shifted east, 
then an OR operation is performed with the 
original image. Then the image is shifted west 
and ORed, shifted north and ORed, and shifted 
south and ORed. This operation enlarges the 
width of all edges. Individual missing pixels are 
filled in to provide a continuous border. Larger 
gaps can be filled by shifting the image two or 
more units. To determine the true edge, the pro- 
cessor then erodes the image by shifting and 
performing an AND operation, effectively 
eliminating all the blocks it created during 
dilation except for those determined by the al- 
gorithm to be part of the actual image. 

Since an exact match between the object be- 
ing viewed and the image stored in memory is 
rare, convolution and correlation are necessary 
functions to determine how close the match ac- 


tually is. In the simple 101 pattern given earlier, 
perfect matches are simple to demonstrate. In 
the longer strings found in real-world pattern 
recognition, convolution and correlation adjust 
for minor discrepancies. 


Convolution and correlation 


Convolution is employed in edge enhance- 
ment, for instance, toimprove the quality of the 
image. It also calls on the array’s ability to 
handle global broadcasts. In convolving a 3- 
by-3-pixel mask with an 8-bit gray scale, the 
mask is placed over every pixel in the image and 
the product terms in each 3-by-3-pixel window 
are summed. 

Global broadcasting lets the system send a 
single portion of the 3-by-3-pixel mask to each 
of the cells within the array. To speed pro- 


Table 3. instruction set for the systolic array processor 
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into CM 


Load zero into CM 


Micro NOP 

Load NS from RAM 
Move from N into NS 
Move from S into NS 
Move from EW into NS 
Move from C into NS 
Load 0 into NS 


Micro NOP 

Load EW from RAM 
Move from E into EW 
Move from W into EW 
Move from NS into EW 
Move from C into EW 
Load 0 into EW 


Micro NOP 

Load C from RAM 
Move from NS into C 
Move from EW into C 
Load C from Carry 
Load C trom Borrow 
Load 0 into C 

Load 1 into C 


Read from RAM 
Load ARAM from CM 
Load RAM from C 
Load RAM from Sum 
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Handling real-time images 
comes naturally 
to systolic array chip 


The internal memory and specialized algorithms 
of a systolic array IC cut the amount of hardware 
and boost the speed associated with image processing. 





This is the second in a series focusing on the 
first commercial systolic array processor chip, 
developed by NCR Corp.’s Microelectronics 
Division in Fort Collins, Colo. The opening 
article was the Oct. 31 cover story (p. 207). 
Upcoming discussions will investigate the de- 
vice’s use in pattern recognition, data-base 
management, and as an associative processor. 


has been a difficult task, calling for a 

large amount of hardware. Most high- 
performance systems comprise a frame buffer, 
which stores the incoming image; a high-speed, 
pipelined processor to carry out the needed al- 
gebraic manipulations; and a second buffer to 
retain the processed image. Although inter- 
leaved sequential memory accesses in such 


U ntil recently, real-time image processing 
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setups make it possible to load and unload the 
buffers rapidly, the bandwidth of the memory- 
processor bus limits throughput. Furthermore, 
some image-processing algorithms require 
several fetches for each pixel, further cutting 
into overall] system speed. 

The Geometric Arithmetic Paralle! Pro- 
cessor (GAPP) chip overcomes these obstacles 
by supplying an array of 72 parallel bit-serial 
processor elements, each of which is fitted with 
128 bits of RAM. This configuration lets system 
designers dedicate an individual processor ele- 
ment to every pixel. To cut costs, though, many 
systems could handle smal! groups of pixels or 
subimages serially, assigning more than one 
pixel to a processor element, or cell. In fact, the 
systolic array can be viewed as a combined 
frame buffer and processor, bringing a bit- 
mapped an image into its RAM, processing it, 
and then putting it back in RAM before sending 
it out. One example of the chip’s prowess is its 
ability to store two images in its RAM and then 
deliver the difference between them. For design 
considerations, the monolithic array can also 
be considered a highly pipelined, parallel 
processor. 

Since the chip departs substantially from the 
conventional von Neumann architecture, 
image-processing systems based on it must 
vary from the usual as well. To demonstrate 
these differences, it is necessary to briefly ex- 
amine the traditional approaches. One, for in- 
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is being sent, the next frame can be loaded. In 
this case, once three frames are loaded, real- 
time pipelined processing is obtained. The set of 
three 16-bit latches multiplexed onto the Multi- 
bus board also lets the host exchange data with 
the system. Finally, information can also be de- 
livered to the line buffer and sent to a video 
monitor using a d-a converter. 

The implementation of a control store lets 
the arrays receive a set of instructions from the 
host and store them, freeing the host for other 
tasks. The store operates in conjunction with a 
sequencer that watches for and maintains the 
correct sequence as the arrays perform their in- 
structions. 

A system such as this can also be imple- 
mented as a workstation for developing 
software meant for the processor chip. An 


Multiplexer 
South 


elements 
(32 systolic processor chips) 





upcoming software simulator and assembler, 
running in conjunction with the workstation, 
will allow users to load input data into a group 
of arrays, run through a sequence of instruc- 
tions, and transmit the results back to the host 
computer. Additionally, a software library of 
macrocells will form the basis of a high-level 
command set for the processor.O 
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4. A single-board system can be built around two blocks of array processor chips. One block 
serves as the processing unit; the second, as the line buffer. The control store retains the com- 
mands from the host, freeing it to carry out other tasks. 
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since these chips can compute while handling 
the serial-to-parallel shift. Regardless of 
whether systolic arrays are used, the chip’s 
memory associated with each processor ele- 
ment allows it to simultaneously store up to 16 
images of 8 bits each, obviating the need for 
frame buffers. 


Quicker than the eye 


Once the architecture of the image-process- 
ing system is selected, the next concern is decid- 
ing on the number of systolic array chips (see 
“Welcoming Aboard the Systolic Array,” 
p. 293). When speed is the primary concern, a 
one-to-one relationship between processor ele- 
ments and pixels can be established. A block of 
512 by 512 processor elements, made up of about 
3700 chips, can perform 100 billion 8-bit addi- 
tions a second. In the thirtieth of a second it 
takes to bring in a typical television frame, 
every cell can execute 13,333 8-bit additions or 
333,335 primitive single-cycle instructions—for 
more than the number demanded by many real- 
time image-processing algorithms (see 
“Systolically Altered States,” p. 294). 

Thus instead of a simple 1:1 ratio between 
processor elements and pixels, a system might 
dedicate one element to a number of pixels and 
thus process data in the form of windows. When 
one window is completed, processing can begin 
on the next. 


Beat the clock 


In asyster. involving a real-time algorithm, 
which does not require the use of previous im- 
age frames, the entire 512-by-512-pixel image 
need not be in the GAPP array all at once, 
thereby cutting the number of devices required. 
In a so-called neighborhood processing algo- 
rithm—one that determines the next value of a 
pixel by comparing it with the pixels surround- 
ing it—a block of 24 by 516 processor elements, 
consisting of 172 systolic devices, can carry out 
600 additions on every pixel while operating at 
10 MHz—far more processing power than 
available with conventional architectv es. 

Since less hardware is used, the necessary 
program may be larger and more complex than 
that found in architectures devoting one pro- 
cessor cell to every pixel. Despite such differ- 
ences, the algorithms share many attributes. In 





this set up, each pixel is stored in internal RAM, 
and although it might first appear that 128 bits 
of image data can be held in memory, the need 
to retain operands and intermediate results 
and to flag overflows reduces the chip’s capaci- 
ty somewhat. As in the first configuration, the 
number of systolic devices can be boosted or cut. 


A different point of view 


Programming the systolic array is radically 
different from programming a traditional mi- 
croprocessur. The first is a single-instruction, 
multiple-data path (SIMD) machine; the sec- 
ond, a single-instruction, single-data path 
(SISD) device. For that reason, code for an ex- 
isting chip cannot simply be converted: Writing 
software for the systolic chip demands a new 
way of looking at both the task and the neces- 
sary algorithm. 

To facilitate programming the systolic pro- 
cessor, a simulator that runs on personal com- 
puters has been created. Written in C, the soft- 
ware runs under Unix and operates NCR’s PC-4 
and on the IBM PC XT as well as on larger 
systems like the Digital Equipment PDP-11 


Video line 
buffer 
(8- to N-bit 
parallei-to-serial 
shift register) 


serial-to-paraliel 
shift register) 


2. A video line buffer, which stores a full input line 
trom the camera, can be made up of either shift reg- 
isters or systolic array chips. The 128 bits of RAM 
included for each processor element in the GAPP 
biock eliminate the need for frame buffers. 
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stance, relies ona pipelined ALU, with separate 
frame buffers for input and output. Pipelining 
joins a series of processor elements to perform 
sequential arithmetic operations on a continu- 
ous data stream. The method is good with pro- 
cessors that range from bit-slice devices to 
supercomputers. Nonetheless, even the latter 
can perform only from 20 to 100 operations on 
each pixel to sustain a real-time rate of 10 mega- 
pixels/s, the rate of standard video systems. 

The systolic array can drop into such an ar- 
chitecture (Fig. 1). With 32 of the chips joined 
together to create a grid of 48 by 48 processor 
elements totaling 2304 processors, up to 60 mil- 
lion pixels/s can be accepted, even with a gray- 
scale depth of 8 bits a pixel. Since data can be 
loaded over the chip’s communication (CM) bus 
at the same time that it is processed, the grid 
array can operate at full speed at all times, 
chewing up 920 million macroinstructions ev- 
ery second. (A macroinstruction is defined here 
as an 8-bit addition that can be executed in 25 
cycles, or 2.5 ws.) Linking together more chips 
further increases processing power. 

Despite its impressive speed, the architec- 
ture is not optimal for the systolic processor be- 
cause data must be reformatted to work with 
the array. The chip works with information in 
the form of bit planes. Asa result, an 8-bit num- 
ber representing the pixels must first be re- 
formatted as a bit plane. The first bit plane rep- 
resents the least-significant bits. Once in the 
array, the whole plane is written to one location 


within the internal RAM of each processor ele- 
ment. The next seven bits must be loaded simi- 
larly, but such reformatting 1s too complex for 
most frame buffers. 


Shifting into first 


To overcome this hurdle, a designer can turn 
to serial-to-parallel shift registers long enough 
to store one full video line (Fig. 2). During the 
horizontal! retracing period of the television 
signal, the previous video line is shifted into the 
edge of systolic arrays, which can consist of any 
number of chips. The least significant bit of 
each pixel in the line is shifted into the bottom 
row of processor elements and written into 
RAM address 1. The next most significant bit is 
then shifted in and written to RAM address 2. 
The process continues until al! eight bits of ev- 
ery pixel line have been loaded into RAM ad- 
dresses 1 through 8 of the bottom row of pro- 
cessor elements. 

Each RAM location of the block is read into 
the CM register before each shift into CM from 
the south (CM=CMS), so that the first video 
line is shifted up and written into the adjacent 
row of processor elements when the second line 
enters the bottom row of processor elements. 
Once the grid is filled, the same process occurs 
as the image is unloaded to the north and sent to 
the output video line buffer. 

The line buffers can be designed with either 
shift registers or with systolic array devices. 
The latter approach enhances performance, 
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1. A Geometric Arithmetic Parallel Processor can be substituted for traditional mi- 
croprocessors in a pipelined architecture. The arrangement requires the memory 
to be very wide, and data to be reorganized. It is thus better to reconfigure the 
architecture to take advantage of the chip’s properties. 
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Since the Geometric Arithmetric Parallel Pro- 
cessor differs so radically from traditional pro- 
cessors, a number of aspects of design must be con- 
sidered when a pc board is laid out. Foremost among 
these are the communications lines that join a block 
of systolic processors. 

No support circuitry is called for between the 
chips, which themselves are easily linked to their 
neighbors to the north, south, east, and west. In that, 
they resemble an individual processing element 
within a single array, which is joined to its four 
nearest neighbors. Further, the 84-contact packages 
are readily connected since the North output of one 
IC is physically adjacent to the South output of an 
adjoining chip. The East and West ports are similar- 
ly compatible. 

Terminating the outer edges of a block of arrays 
demands a variety of techniques, depending upon 
which algorithm is being executed. That presents no 
problem, though, since a programmable multiplexer 
can switch from one termination technique to an- 
other, under software control. 

On the one hand, the edge connections can be 
grounded during input cycles so that all shifts bring 
in zeros from the outer edge of the block. Alterna- 
tively, the edges may be tied to a data bus for I/O. 
A third approach 
brings the connec- 
tions from the east 
and north around to 
those of the west and 
south, respectively, so 
that data is recycled. 

These connections 
can be made without 
concern for loading 
and fan-out, since 
they involve only the 
processor elements at 
the edge of the group 
of chips. Control, ad- 














































Weicoming aboard the systolic processor 



























































dress, and clock signals, however, must be bused to 
each device in a grid of chips. In wraparound lay- 
outs, synchronization is critical between the clock 
and control lines at the edges of the block. 

When large blocks of the chips are grouped to- 
gether, it is generally best to drive them in groups 
of less than 40 chips. Driving more chips can skew 
timing and may exceed the power capabilities of 
driver chips. The routing for this type of bus is 
best laid out using an H-shaped topology (see the 
figure). 

When a number of chips are being clocked syn- 
chronously and driven in parallel by command 
drivers, power distribution must be uniform. There- 
fore, boards using wire-wrapped interconnections 
should have full surface power and ground planes. 
Inattention to the capacitive details of coupling and 
ground planes can cause undershoot and overshoot 
of signals. To supply a new contro] word every 100 
ns, keeping pace with the device's 10-MHz clock, a 
20-bit-wide instruction queue for both the control 
and address lines is needed. Most designs, however, 
should include 24 or more extra bits to ensure space 
for control functions and looping. Static RAMs are 
the simplest to use for this; however, for high speed 
2k-by-8-bit RAMs are preferred. 

The instruction 
queue in a system 
based on a systolic ar- 
ray is driven by high- 
speed address sequen- 
ces. The four extra 
bits in a 24-bit-wide 
instruction queue can 
be used to control 
jumps and loops of an 
address sequencer. 
The Global Output 
signal from the array 
can serve as a flag for 
conditional jumps. 
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and NCR’s Tower 1632. 

Although the advantages of simulating oper- 
ation while the hardware is being designed are 
obvious, it must be noted that running the array 
program on a single-instruction, single-data- 
path computer will be very slow. A task exe- 
cuted as a single instruction on a systolic array 
will require at least N’ operations when it runs 
on aconventional processor, where N equals the 
number of processor elements along one axis of 
the array. 

Consider the addition of two 8-bit, 512-by- 
512-pixel images. A 10-MHz, 8-bit processor 
needs at least 1 second to do the job. As men- 
tioned earlier, a grid of 512 processor elements 
could perform the same function in 25 cycles, or 
about 2.5 us. 


Breaking with convention 


Another factor that must be considered when 
the simulator runs on a traditional computer is 
the relationship between the memory and a 
processor. A conventional processor passes 
data between itself and memory. The systolic 
array, in contrast, has the aforementioned 128 
bits of RAM associated with each cell, and every 
memory address holds one bit. Consequently, to 
speed the simulator’s operation while simplify- 
ing the development of algorithms, the size of 
the grid should generally be kept down to 6 by 6 
or 12 by 12 processor elements. Fortunately, 
software written for a small array can run 
without modification on larger arrays. 

Once a satisfactory set of algorithms for a 
particular job is complete, an assembler con- 
verts the mnemonic code created by the simula- 
tor into binary instructions for the target ma- 
chine. The assembler produces a binary file 
that can be loaded into a high-speed, 20-bit- 
wide memory, dubbed the instruction queue. 
The queue holds the algorithm for execution at 
the frame rate selected for the system. Ina real- 
time system, say, data comes in and processed 
data goes out simultaneously. As the algo- 
rithms run, a complete loop through the in- 
struction queue is repeated for every new frame 
passing through the grid. 

The kinds of algorithms that must be devel- 
oped for image processing are, of course, direct- 
ly tied to both the specific demands of such 
processing and to the way the array works. 


Image-processing computations are more dis- 
tinctly parallel than those of scientific and 
business calculations, in which memory use and 
the operations performed are far more random. 

The speed with which the systolic array han- 
dles such parallel chores can be clearly seen by 
again comparing the array to a traditional pro- 
cessor. A von Neumann machine requires on 
the order of N X N cycles to process an N X N 
pixel image. That interval is expressed as 
O(N’), which is short for “order N squared”. The 
systolic array needs only O(N), or even O(k) 
cycles, where k equals either the number of 
bits per pixel or the number of digits used in the 
calculation, to process the same image. 

When the array processes an image, each 
element is active simultaneously, so the time 
needed to subtract one image from another is 
independent of the size of the image. Algo- 
rithms for the primitive operations of image 
processing—adding and translating an image 
along an axis and manipulating the gray scale 
—can be performed in O(k) time. Furthermore, 
operations that normally occur within the indi- 
vidual registers of a von Neumann processor 
(bit inversion, bit setting or resetting, and bit 
shifting) are easily handled in O(k) cycles by 
the systolic array. 


Nothing to it 


Other algorithms handled just as readily by 
the device are those requiring information 
about the four or eight neighboring pixels. A 
4-neighborhood algorithm can be defined as 
one using the north, south, east, and west pro- 
cessing elements of a particular portion of an 
image. The eight-pixel neighborhood consists 
of those four plus the northeast, northwest, 
southeast, and southwest cells. Such algo- 
rithms include 3-by-3-pixel convolution, a 
3-by-3-block pattern matching, and various 
types of erosion and dilation. Al] of these are 
classified as local algorithms, since they do not 
require information from any elements other 
than their immediate neighbors. 

Global algorithms, on the other hand, like 
histograms and correlations, need information 
from more distant elements. They take O(N) 
time, much faster than the time demanded by a 
traditional computer. 

Certain fundamental operations are common 


ations. Thresholding determines which pixel 
values are greater or less than a predetermined 
level. In an application that needs to zero (that 
is, ignore or turn into zeros) all the pixels witha 
gray-scale value of less than 20, the first step is 
to make a copy of the image’s data base, which 
is destroyed as the task is carried out. 

Since a 6-bit field can represent numbers 
from 0 to 63, adding 44 to every pixel will cause 
all those with values greater than 19 to over- 
flow. The overflow bit plane must then be in- 
verted to yield a zero overflow bit in every pixel 
where that occurred. If the inverted overflow 
bit is then ANDed with the original fields, all 
the pixels that overflowed will have their fields 
zeroed. 

The entire task can be rapidly performed by 
using a global broadcast, which simultaneously 
places a given value (in this case, 44) in every 
processor element in the array. Obviously that 
is faster than moving the data through the ar- 
ray until it has reached each processor element. 
To place the binary value 101100 into RAM lo- 
cations 21 to 26 in every element, the following 
instruction sequence would be executed: 


C:= 
RAM26:=C, C = 0 
RAM25:=C, C = 1 
RAM24:=C, C = 1 
RAM23:=C C = 0 
RAM22:=C,C = 0 
RAM21:=C 


Another chore common to image processing, 
finding the maximum pixel value in an image, 
lends itself to the architecture of the systolic 
array. A number of algorithms could be used, 
depending on the desired objective. One takes 
advantage of the chip’s Global Output (GO) line 
to furnish the value of the highest-intensity 
pixel (MAXVAL) within a O(k) interval (Pro- 
gram 1). 


Once the algorithm is completed, the pro- 
cessor elements wth the maximum intensity 
value will have a logic 1 stored in their EW reg- 
isters. The same algorithm can also determine 
the value of the lowest-intensity pixel (MINVAL) 
by first making a negative from the image, 
which is accomplished by simply inverting each 
bit of the pixel. 

In some instances, it is desirable to determine 
the location of the highest-intensity pixels. The 
only additions needed are a bit detector (a sim- 
ple comparator) and another algorithm (Fig. 3). 
The comparator simply accepts inputs from the 
array until a logic 1 is picked up. It then sends 


form = 1toM do 
forn = 1toNdo 


EW:= E 
it bit_detect = 1 


pixel_location « m,n 
} 


ne 
Controller 


* 
On interrupt. 
send m and n 
to the hest 


to east inputs 





3. By running a specific algorithm, a comparator 
serving as a bit detector can determine the location 
of pixels with the greatest gray-scale values. When a 
logic 1, which denotes such pixels, is observed, the 
controller is interrupted and sends the location of 
the bit to the host. 
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to both local and global algorithms. One such 
operation, or building block, is overflow detec- 
tion, which is used for many tasks. 

One approach to it conjoins a 1-bit field with 
each field to be operated upon. Adding a field of 
3 bits and a field of 5 bits will probably cause an 
overflow if it is delivered to a 3-bit field, so a1 
will be placed into the overflow field. The result- 
ant image provides useful information about 
the data being processed. For instance, the 
overflow bit may be used to generate a visual, 
cue, like light or dark spots on the screen, to in- 
dicate which elements have overflowed. It can 
be used to interactively adjust the algorithm. 

Among the other operations necessary for 
image processing are common arithmetic func- 
tions like addition, subtraction, and multi- 
plication. Generally, images consist only of pos- 
itive numbers representing the gray-scale 
value of the pixel. Image multiplication is need- 
ed for windowing or masking. A two-dimen- 
sional template representing a window may be 
shifted into the array and multiplied by the 
resident image. Any of these arithmetic oper- 
ations may cause an overflow, which will be in- 
dicated if an overflow bit plane is used in the re- 
sult field. 

Register shifting is taken care of in the same 


manner as moving a contiguous section of 
memory on a standard machine. To shift up- 
ward in memory index, the highest numbered 
element in the block is shifted first, followed by 
the second highest, the third, and so on. Once 
again, overflow detection is needed to deter- 
mine whether an element is shifted out of its 
field, since the program cannot write outside 
the field. 

Translation, another basic operation, is one 
of the simplest for the chip to handle because of 
the relationship between neighboring pro- 
cessors. To shift toward the east a 1-bit field 
located at RAM address 12 within the processor 
array, simply execute: 

EW: = RAM12 
EW:= W 

C:= EW 

RAM 12:=W 


Here, overflow detection is not needed, since 
there is no possibility of an overflow taking 
place. 


Back to basics 


One basic task of image processing, thresh- 
olding, unites a number of the foregoing oper- 


Program 1. Establishing the highest-intensity pixels 


COMMENT: initialize EW = 1 

WS: =0, EW:=0, C:=1 

NS: =0, EW:=C, C:=0 

COMMENT: Loop from MSB to LSB and detiver MAXVAL as bit serial output on GO 
forn = &to 1 do : 


{ 


NS: =AAMn, EW:=EW,C:<0 
NS:=NS5, EW:=EW,C=Cy 
NS:=C, EW: =EW, C:=0 


# GO=1 
{ 


EW: =EwW 


i 
it GO=0 


I 
| 


(Read next bit from RAM into NS) 
(Form NS “and” EW) 

(Send result to GO from NS) 

(Bit n of MAXVAL = 0 from NS) 
(EW retains present vaiue) 

{Bit n of MAXVAL = 1) 


(EW set to 0) 
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Program 2. Binary-tree summing 


form = 0 to 5 do 


{ 
¢:=0 
for n=n1 to {8+ mM) do 


NS:=RAMn, EW:= EW, C:=C 
NS:=NS, EW: =RAMn, C:=C 
for p=1 to 2° m 

NS:=S, EW: = EW, C:=C 


{ 
RAMn: = SM, C:=CY 


i 
RAM (M+9):=CY 





Program 3. Sorting pixels into bins 


NS:=0. EW:=0, C:=1 
NS:=0, EW:=0, C:=1, RAM:0=C (Initialize RAM 0 = 1} 
for n=1 to 6 do 















(Broadcast bin bit n) 


NS. =0, EW:=0, C:=X (Where X is the value of 
. bin bit n) 
NS: = RAMn, EW: =C, C:=1 (Read bit n of image 
pixel) 
NS: = RAM127, EW: =EW, C:=1, RAM127:~SM (SM = 1 if NS matches EW) 
NS:=NS, EW: =RAMO, C:=0 (Read RAM 0 and compare 
with RAM 127) 
NS: =NS, EW: =EW, C:=CY {CY=1 it RAM 0 and 
RAM 217 were both 1) 
NS=NS, EW=EW, C=1 (If all six bits match, 


then RAM 0 will continue 
to contain 1) 
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an interrupt to the controller, which locates the 
highest-intensity pixels by counting the num- 
ber of zeros that preceded them. 


Stand and be counted 


Counting the number of pixels that are 
displayed at maximum intensity is also done 
relatively simply and quickly with the array. 
Traditional processors would take O(N X N) 
operations, but an array-based binary tree ap- 
proach performs a number of additions in par- 
allel, hence requiring only O(log N) operations. 
Several pairs of numbers are added within all 
columns of an array, then pairs of these results 
are added in parallel. The resulting data flows 
upward through the block of arrays until the 
sum reaches the top processor element of each 
column. 

At that point, a second algorithm sums the 
values in the rows until the total] for the entire 
block is contained in the upper-left-hand pro- 
cessor element. Since translation operations 
cause data to shift into the edge of the array, 
these inputs must be set to zero so that the ex- 
ternal data contributes zerotothesum.A 
binary-tree summation of a column of 64 num- 
bers first assumes that the numbers are 8-bit 
pixel values. They are also assumed to reside in 
RAM locations 1 for the LSB to 8 for the MSB 
(Program 2). The partial sums are stored in 
RAM locations 1 through 14. 


Straightforward convolutions 


Convolution is one of the most important jobs 
performed in image processing. It uses the pre- 
viously described neighborhood algorithm to 
determine new values for pixels, thereby en- 
hancing an image. Convolutions are put to work 
along the entire range of image processing, 
from upgrading old photographs to improving 
the definition of edges in a robotic vision 
system. 

Convolution is characterized by a high level 
of parallelism, so it is well suited to the systolic 


array. Typically, a template of new values is 
placed over the values of the camera image. 
Global broadcasting distributes the template. 
The objective is to move the sum outward ina 
spiral from the center of the template, which is 
the location of the new pixel value, to each of 
the matrix elements that reside under the 
template. At each matrix a location multipli- 
cation is performed, and the result is added to a 
traveling sum. The image resulting from this 
convolution is enhanced. Since all of the sum- 
mations occur simultaneously, the parallel 
array processor handles the job at a good clip. 

Histograms, which count the number of pix- 
els containing particular gray-scale values, can 
make adjustments for changes in lighting, as 
well as let systems adjust to very light or very 
dark images. In that way they improve visual 
information at either end of the intensity 
spectrum. 

The process is handled as quickly as the 
array’s global-sum operation counts the pro- 
cessor elements. The elements to be counted are 
first identified by broadcasting a gray-scale 
value to every processor element and com- 
paring it with the pixel value stored in each. 
Matches to the image stored in RAM locations 1 
to 6 are determined by using a specific algo- 
rithm (Program 3). Various values are broad- 
cast to create series of “bins,” with different 
pixel levels sorted into the appropriate bins. 

After this task is finished, every processor 
element that holds a pixel matching the broad- 
cast pixel will have a logic 1 in RAM location 0. 
Before counting the number of pixels, a quick 
check for GO= 1 will indicate if there were any 
pixels at all which matched the broadcast value. 
By determining the number of pixels in the var- 
ious bins, the system can figure out whether the 
image is dark or light or contains a variety of 
shades, making adjustments as necessary.D 
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tracting, and deciding (Fig. 1). Each may be 
processed by a dedicated hardware component, 
or two or more tasks can be processed by the 
same hardware. 

The improver block accepts raw data from 
the sensor, in many cases, a camera. Then it 
either restores the signal, correcting degrada- 
tions caused by the sensor, or enhances it, 
boosting the quality of the image to facilitate 
further processing. Signal restoration typically 
encompasses image deblurring, while enhance- 
ment includes edge enhancement or smoothing. 

The data is then passed along to the screening 
block, which removes the information not re- 
quired for the sophisticated algorithms that 
follow. Among this extraneous data is both 
noise and pixels below a given threshold. 

The extractor pulls out the primative charac- 
teristics that can be used by the decider to 
recognize the object. And the final block, the de- 
cider, looks at a!] the characteristics and attri- 
butes gathered concerning the object and its 
surroundings, compares them with its under- 
standing of the object's features, and decides 
whether it recognizes the object. 

The systolic array chip can be put to work in 
any of these blocks and is easily configured to 
process all of these sequential steps at the real- 
time rate of most video cameras (10 MHz). At 





present, they are handled by the hardware ar- 
chitecture deemed most appropriate for the 
task at hand. Both improving and screening are 
high-speed computations that involve a small 
number of dedicated algorithms, so they are 
generally handled by pipelined processors. Ex- 
tracting and deciding are usually taken care of 
by parallel microcomputers because of the wide 
variety of algorithms and different types of 
computations they involve. 


Process in parallel 


Another scheme, massively parallel! pro- 
cessing, dedicates one processor to every pixel 
in an image, thus ensuring very high speeds. 
The systolic array goes with this approach; con- 
sequently, designers have that particular ar- 
chitecture available in relatively small 
systems. 

In practice, arrays can be grouped together 
until the number of processor elements match- 
es the number of pixels being processed. Fur- 
ther, each of the chip’s 72 processor elements 
has 128 bits of dedicated RAM, which gives it an 
additional level of flexibility. This internal 
memory also increases throughput by elimi- 
nating the time-consuming data fetches re- 
quired with typical von Neumann processors. 

The array’s single instruction, multiple-data 


Deciding 


1. After data is picked up by a sensor, the typica! pattern recogni- 
tion system processes it in four sequential taesks—improving, 
screening, extracting, and deciding. Any of the four can be han- 
died by the Geometric Arithmetic Parallel Processor, set up as a 
dedicated hardware component. Alternatively, two or more tasks 
can be shared by the same hardware. 
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Systolic array chip 
recognizes visual patterns 
quicker than a wink 





Simultaneously processing a host of pixel values, 
a monolithic systolic array gives pattern recognition 
systems the get up and go to work in real time. 





The third article in a series dedicated to the first com- 
mercial systolic array processor focuses in on pattern 
recognition. The first (Oct. 31) introduced the chip and 
the second (Nov. 15) investigated its use in imaye ma- 
nipulation. Forthcoming discussions will explore using 
the chip as an associative memory unit and an asso- 
ciative processor, as well as data-base management. 


that lets automatic inspection and im- 
age-processing systems emulate the cog- 
nitive talents human beings take for granted. 
Working with it, a system looks at an object, 
determines what it is—often improving the 
image sent toit in the process—and establishes 
if the object under scrutiny meets specific 

criteria. 
A multitude of algorithmic approaches and 


Pir recognition is a simple concept 
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hardware architectures have been tried in the 
attempt to let machines quickly convert data 
from sensors into information that can be used 
to decide on the next action. The Geometric 
Arithmetic parallel processor (GAPP),a CMOS 
chip that carries 72 single-bit microprocessors 
that run in parallel, is suited to many of these 
schemes. 

Its multiple data paths make it right at home 
in a range of settings, allowing designers to 
turn to a single device instead of to a number of 
dedicated pieces of hardware. Further, specific 
algorithms can be assigned to various tasks, 
since the software for the array has more in 
common than do programs designed for unre- 
lated hardware. 

Despite the assortment of algorithms de- 
voted to pattern recognition, the approach can 
generally be broken down into two categories — 
template matching and feature matching. The 
first scheme is the most straightforward. Using 
it, a system simply compares an incoming im- 
age to those stored in memory unti] the match- 
ing pattern, or template, is found. 

Feature matching is more sophisticated and 
thus demands more processing power. In it, the 
system views the length, width, and other char- 
acteristics of an object to determine what it is 
without comparing it to a template. 

Either technique can be readily handled by 
the systolic array. Both comprise four basic 
tasks, or blocks—improving, screening, ex- 


4 


“1 


ome | et 


So 


| "| | tears beneinc eS | Gots 


Nell 


be 


tracting, and deciding (Fig. 1). Each may be 
processed by a dedicated hardware component, 
or two or more tasks can be processed by the 
same hardware. 

The improver block aecepts raw data from 
the sensor, in many cases, a camera. Then it 
either restores the signal, correcting degrada- 
tions caused by the sensor, or enhances it, 
boosting the quality of the image to facilitate 
further processing. Signal restoration typically 
encompasses image deblurring, while enhance- 
ment includes edge enhancement or smoothing. 

The data is then passed along to the screening 
block, which removes the information not re- 
quired for the sophisticated algorithms that 
follow. Among this extraneous data is both 
noise and pixels below a given threshold. 

The extractor pulls out the primative charac- 
teristics that can be used by the decider to 
recognize the object. And the final block, the de- 
cider, looks at al] the characteristics and attri- 
butes gathered concerning the object and its 
surroundings, compares them with its under- 
standing of the object’s features, and decides 
whether it recognizes the object. 

The systolic array chip can be put to work in 
any of these blocks and is easily configured to 
process all of these sequential steps at the real- 
time rate of most video cameras (10 MHz). At 


Improving 





present, they are handled by the hardware ar- 
chitecture deemed most appropriate for the 
task at hand. Both improving and screening are 
high-speed computations that involve a small 
number of dedicated algorithms, so they are 
generally handled by pipelined processors. Ex- 
tracting and deciding are usually taken care of 
by parallel microcomputers because of the wide 
variety of algorithms and different types of 
computations they involve. 


Process in parallel 


Another scheme, massively parallel pro- 
cessing, dedicates one processor to every pixel 
in an image, thus ensuring very high speeds. 
The systolic array goes with this approach; con- 
sequently, designers have that particular ar- 
chitecture available in relatively small 
systems. 

In practice, arrays can be grouped together 
until the number of processor elements match- 
es the number of pixels being processed. Fur- 
ther, each of the chip’s 72 processor elements 
has 128 bits of dedicated RAM, which gives it an 
additional level of flexibility. This internal 
memory also increases throughput by elimi- 
nating the time-consuming data fetches re- 
quired with typical von Neumann processors. 

The array’s single instruction, multiple-data 


Extracting Deciding 


1. After data is picked up by a sensor, the typica! pattern recogni- 
tion system processes it in four sequential tasks—improving, 
screening, extracting, and deciding. Any of the four can be han- 
died by the Geometric Arithmetic Parallel Processor, set up as a 
dedicated hardware component. Alternatively, two or more tasks 
can be shared by the same hardware. 
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path ts particularly well suited to the afore- 
mentioned task of improving, since the al- 
gorithms used for it demand pixel processing 
using only local comparisons to determine the 
value of a pixel. This so-called neighborhood 
processing takes advantage of the structure of 
the chip, in which each processor element com- 
municates with its nearest neighbors on the 
north, south, east and west. 


Back to basics 


One of the most fundamental algorithms 
used in restoration consists mainly of adding 
successive frames of the same image (on a pixel- 
by-pixel basis) to yield a running average. 
Doing so improves the signal-to-noise ratio of 
the image, making it easier for the system to 
process and making it more visually pleasing to 
the operator overseeing the task on a display. 

The actual code used to add two 8-bit images 
(see Program 1) assumes that the first is stored 
in RAM locations 0 to 7 (with the MSB at the 
highest location) and that the second is stored 
in RAM locations 8 to 15 (the MSB here is held 
in location 15). It also takes for granted that 
both words are positive. Simple extensions, 
though, allow negative numbers to be added or 
subtracted. 

When the 25 instructions are finished, the 
two input images remain in the same RAM lo- 
cations, while the sum of the images is stored in 
locations 16 to 24. The number of instructions 
needed to add an m-bit number to an n-bit num- 
ber, where n=m, can easily be determined with 
the equation 8m+2(n—m)+1. When both 
numbers are 8, as in the above example, the re- 
sult is 25. 

Another common algorithm, this one used in 
image enhancement is a finite-impulse-re- 
sponse filter. The equation: 

N-1 
Y(n)= 2 a(i)I(n—i) 
i= 0 
represents the output, Y(n) in terms of the in- 
put, I(n). It consists of both adds and shifts. 

The objects being observed are generally well 
defined in contrast to the background, making 
it easy for the system to pull out the character- 
istics needed to recognize a pattern. When the 
data is received by the screening block, how- 
ever, one of its key tasks is to replace the weak 


signals along the edges of the object with 
stronger signals. 

A common technique used for edge enhance- 
ment calls for a Sobel filter, a two-dimensional 
finite-impulse-response filter with a threshold- 
ing algorithm. The Sobel filter takes an existing 
image and creates a new one comprising the 
magnitude and direction of all the strong edges 
of the object. 

The filter works with neighborhood pro- 
cessing, determining the value of a pixel by ex- 
amining those adjacent to it. With a 3-by-3- 
pixel grouping, consisting of pixels A through I, 


Program 1. Double vision: 
Adding two 8-bit images 


NS: =RAM 0; C:=0 
EW:=RAM 8 

RAM 16:=SM; C:=CyY 
NS:=RAM 1 
EW: = RAM 9 

RAM 17:=SM; C:=CY 
NS:=RAM 2 

EW: =RAM 10 

RAM 18:=SM; C:=CY 
10 NS:~RAM 3 
EW:=RAM 14 

RAM 19:=SM; C:=CY 
NS: = RAM 4 


EW: =RAM 12 

RAM 20:=SM; C:=CY 
NS:= RAM 5 
EW:=RAM 13 

RAM 21:=8M; C=CY 
NS:= RAM 6 
EW: = RAM 14 

RAM 22:=SM;, C=CY 
NS: = RAM 7 

EW: =RAM 15 

RAM 23:~SM: C=CY 
RAM 24:=C 


COOn RO ADA = 





(A+B) (B+C) - 

(+E) (E+F) - 

(G+) (H+) - 
(b) 


{(A+B)+(B+6)} - 


(O+E}+(E+F) - 
- (G+H)+(K+) - 
(c} 


2. Edge enhancement using a Sobel filter furnishes 
a new vatue for pixel E by comparing it with those 
(A through D and F through 1) that surround it (a). 
Pixel values are added together in parallel (b,c) to 
yield the value (d). All values are added in parallel. 
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Program 2. Sobel-fiiter to 
establish pixel value 


Line Code Line Code 





































RAM 34:=C 


1 EW: = RAM 0: C:=0 54 NS: =AAM 25 
2 EW:=€: NS:=RAM 8 55 NS:=5§;C:=0 
3 RAM 16:=SM: C:=CY 56 NS: =RAM 25; EW:=NS 
4 EW: =RAM 1 57 NS:=N 
5 EW:=E; NS:=RAM $ 58 RAM 35:=SM: C:=BW 
6 .RAM 17:=SM, C:=CY 59 NS:=RAM 26 
7 EW: = RAM 2 60 NS$:=S 
] EW:=E, NS: = RAM 10 61 NS: = RAM 26, EW:=NS 
9 RAM 18:=SM; C:=CY 62 NS:=N 
10 EW: = RAM 3 63 RAM 36:=SM; C:=aw 
11 EW:=E: NS:=RAM 11 64 NS: = RAM 27 
12 RAM 19:=SM; C:=CY 65 NS:=S 
13 EW:=RAM 4 66 NS: = RAM 27; EW:=NS 
14 EW:=€, NS:=RAM 12 67 NS:=N 
15 RAM 20:=SM; C:=CY 68 RAM 37:=SM: C:=BW 
16 EW: RAM 5 69 NS:=RAM 28 
17) «EW:=E; NS:=RAM 13 70 WNS:=S 
18 RAM 21:=SM; C:=CY 71 NS: =RAM 28: EW:=NS 
19 EW: =RAM 6 72 NS:=N 
20 EW:=E:; NS:=RAM 14 73 RAM 38:=SM,; C:=8BW 
21 RAM 22:=SM; C:=CY 74 NS: =RAM 29 
22 EW:=RAM7 75 NS:=S 
23. ~EW:=E; NS:= RAM 15 76 NS:=RAM 29; EW:=NS 
24 RAM 23:=SM: C:=CY 77 NS:=N 
25 RAM 24:=C 78 RAM 39:=SM; C:=BW 
26 EW: = RAM 16; C:=0 79 NS: =RAM 30 
27) =EW:=W: NS:=RAM 16 60 NS:=S 
28 RAM 25:=SM; C:=CY Bt NS: = RAM 30; EW:=NS 
29 EW: =RAM 17 82 NS:=N 
30 =EW:=W: NS:=RAM 17 83 RAM 40:-=SM: C:=BW 
31 RAM 26:=SM, C:=CY 64 NS:=RAM 31 
32 EW: = RAM 18 85 NS:=S 
33 EW: =W: NS:=AAM 18 86 NS: = RAM 31; EW:=NS 
34 RAM 27:=SM;C:=CY 87 WNS:=N 
35 EW: =RAM 19 88 RAM 41:=SM; C:=BW 
36 EW:=W: NS:=RAM 19 89 NS: = RAM 32 
37 RAM 28:=SM, C:=CY 90 6hNS:=S 
38 EW:=RAM 20 91 NS:=RAM 32: EW:=NS 
39 =~EW:=W: NS: =RAM 20 92 NS:=N 
40 RAM 29:=SM. C:=CY 93 RAM 42:=SM; C:=BW 
41 EW: =RAM 21 94 NS: =RAM 33 
42 EW: =W; NS: = RAM 21 95 NS:=S 
43 RAM 30:=SM, C:=CY 96 NS:=RAM 33; EW:=NS 
44 EW: =RAM 22 97 NS: N 
45 EW: =W: NS:=RAM 22 98 RAM 43:=SM; C:=BW 
46 RAM 31:=SM; C:=CY 99 WNS:=RAM 34 
47 EW: = RAM 23 100 NS:=S 
48 EW: =W: NS: =RAM 23 101 NS:=RAM 34; EW:=NS 
49 RAM 32:=SM; C:=CY 102 NS:=N 
56 EW: = RAM 24 103 RAM 44:=SM: C:=BW 
51 EW: = W: NS: = RAM 24 104 WNS:=0:EW:=0 
52 RAM 33:=SM; C:=CY 105 RAM 45:=SM 






(Fig. 2) the equation that determines the Y axis 
values is 


Y = (A+2B+C)-—(G+2H+]) 
= ((A+B)+(B+C)] — [((G+H)+(H+D)] 


The X axis values are established by 
X = (C+2F +1) —-(A+2D+G) 
= ((C+F)+(F+I)] — ((A+D) + (D+G)] 


The code for processing the first equation 
(see Program 2) reveals the unusual aspects in- 
volved in programming the systolic array. The 
true key to the chip’s speed lies in the simulta- 
neous computations it carries out. The first 25 
instructions add each pixel to its neighbor to 
the west to form a new image. 

Likewise, instructions 26 through 53 add this 
new image to the eastern neighbors, forming a 
second image. Instructions 54 to 105 add each 
pixel in that new image to its neighbor to the 
north and finally add each pixel in this third 
image to its sourthern neighbor. The resulting 
data, which is determined for every pixel in the 
image, can have as many as 10 bits, as well asa 
+ or — sign. The latter denotes whether the 
edge gradient is changing from black to white 
or white to black. The fact that each of these 
values is computed in parallel results in fast 
throughput, since the value for the middle row 
is shifted upward, where it becomes the bottom 
row of the 3-by-3-pixel grid being processed in 
an adjacent processor element. The valuealsois 
shifted downward to the adjacent processor 
element, where it becomes the top row for the 
grid being processed there—a 3-by-3-pixel 
group centered around H. 

The following Sobel operations are per- 
formed within individual cells of the array. 
First, a reasonable approximation of the mag- 
nitude of the gradient vector is computed by 
adding the absolute values of X and Y. To obtain 
that result, the sign bit plane for X determines 
whether to invert it and add one toit toform the 
absolute value. Processing the absolute value of 
a number takes 3m+8 instructions, where m 
equals the number of bits. 

Next, the direction of the gradient to the 
nearest 45° line is determined. This must be 
done so that data can be processed by the ex- 
tractor block and is accomplished by using the 
signs of both X and Y to determine which quad- 
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rant contains the gradient. Once the direction is 
established, another process, consisting of 
18m + 76 instructions, is performed to bring the 
vector in line with the nearest 45° angle. 

In thresholding, the final step, the value is 
compared with a predetermined constant. The 
result of this operation is used to pass or reject 
the direction vector, a step necessary to ascer- 
tain that it is valid data and not simply noise. If 
the vector information is below the threshold, 
the resulting word consists of 8 zeros. If it is 
valid, the location of a single 1 in the 8-bit word 
will denote which of eight directions the vector 
lies closest to. Tresholding requires m+13 in- 
structions, with another 18 instructions needed 
to properly place the ] in valid words. 


The tast word 


The resultant word is then passed to the ex- 
tractor. At this point, aspoke filter is called into 
play. It is used to detect objects of various 
shapes and to extract length, width, and other 
data from the image. For example, the filter can 
be used to determine how many of the radial 
spokes along eight axes have Sobel! gradients 
that point to or away from a specific pixel. The 
expected size and shape of the object deter- 
mines the lengths of the spoke’s arms, which 
can range from a single pixel to very large pixel 
blocks. 

The final block in a pattern recognition sys- 
tem handles the task of deciding. This is typi- 
cally carried out by a group of microprocessors 
or bit-slice processors that receive, store, and 
then manipulate the object characteristics that 
have been extracted in the preceding series of 
operations. These manipulations generally in- 
volve projecting a feature into a previously 
determined and segmented space to determine 
what type of object makes up the image. Al- 
though systolic arrays can perform this chore, 
it will probably remain the realm of standard 
processors. Algorithms are in plentiful supply 
that make good use of standard von Neumann 
architectures, which are well suited to the task. 

However, one of the most challenging aspects 
of implementing the decider hardware is link- 
ing it with the high-speed front end hardware. 
Often, this means adding several microcompu- 
ters, hence demonstrating the inherent net- 
working inefficiencies of these architectures. 


The best configuration for transmitting data 
from the systolic array blocks is to use a so- 
called corner-turning block, a small group of 
systolic arrays that reformats data for the 
microprocessor. The systolic array normally 
sends out data in bit planes instead of the 8- or 
16-bit words microprocessors work with so 
readily. The corner-turning block formats the 
bit planes into the gray scale values that make 
up a pixel. At the same time, the arrays can 
postprocess the pixels before passing the final 
decider data on to the microcomputer network. 


Flexibility is the key 


Each of the four tasks—improving, screen- 
ing, extracting, and deciding—can be handled 
easily by systems built around the systolic 
array chip. The IC can be incorporated into a 
number of setups, ranging from those that rec- 
ognize moving objects, through those that work 
with an assembly line robot, to those that rec- 
ognize characters. Indeed, the chip’s flexibility 
can be quickly grasped by these examples. Al- 
though all use roughly the same overall system 
architecture, the array can be programmed to 
meet the various algorithmic and speed de- 
mands called for by particular tasks. Thus a 
generic system can work in a variety of applica- 
tions, giving it the flexibility associated with 
traditional microprocessors (Fig. 8). 

The system itself is centered on the main 
block of chips, but it also employs the array to 
take care of reformatting. These blocks will 
typically be used to format data for input as 
well as for output. 


On automatic pilot 


One example of this generic approach is an 
automatic inspection system that scans parts 
as they move along a production line. The 
system's data rate runs from 8 to 16 million 
words/s, with real-time response required. Im- 
proving and screening would require 1 or 2 algo- 
rithms; the extractor would execute from 10 to 
20 algorithms. The decider would perform 1 to3 
algorithms using the resultant data. 

The major design tradeoff that must be con- 
sidered when building this or any other generic 
system is deciding how many processor ele- 
ments are needed and how they should be ar- 
ranged. That, in turn, hinges on the throughput 
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rate and the number of instructions that must 
be performed. 

For automatic inspection, as many as 3300 in- 
structions might be performed on each frame of 
data. Using a 10-MHz chip, these can be pro- 
cessed In 330 us. Since a standard 30-Hz frame 
is sent only every 33 ms, the chip would be oper- 
ating only 1% of the time if an individual pro- 
cessor element were dedicated to each of the 512 
by 512 pixels. It is thus possible to trim costs by 
building a smaller block of systolic array chips 
and passing the data through the block several 
times. 


On the tube 


To process 3300 instructions, the block would 
require roughly 2623 processor elements, or 
about 37 chips. Since data comes from the 
sensor inthe standard TV-line format, the most 
natural way to arrange the chips, which are 
themselves laid out as 6 by 12 processor ele- 
ments, is to work with a grid of 6 by 516 ele- 
ments. Thus the system could handle six incom- 
ing rows of data at once because the array’s 





communication registers, which move data in 
and out, could easily handle the 512 usefu! 
samples of data coming from a1l0-MHzTV 
camera. 

However, if the spokes employed in feature 
extraction are 10 pixels long, the span of the 
spoke wheel is 20 pixels vertically. With only six 
cells aligned vertically, that causes severe 
complications in implementing the spoke algo- 
rithm. 


Three solutions 


There are three ways out of this problem. The 
simplest is to add two or three more rows of 
chips so the block is 18 to 24 elements deep. 
There will be some overlap problems, but the 
added processing power makes them fairly easy 
to solve. 

Alternatively, two frame buffers could be 
added before the input reformatting block. 
That lets the system send rectangular arrays of 
data into the main block of chips. Overlap prob- 
lems would be less severe, but this solution 
drives cost up since more hardware is used. The 


Display 
interface 


3. A generic system based on the systolic array chip can be used in many applications. Systolic 
arrays are not only the main system building block, but are employed to reformat data going into 
and coming out of the main array. The last performs the processing needed to determine the 


characteristics of the object under scrutiny. 
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tradeoff between the two solutions comes down 
to the cost and time needed to design in the 
requisite buffers, on the one hand, and the 
added software needed to process the spoke- 
filter algorithm when only six rows of cells 
are used, on the other. 


The third time’s the charm 


The ultimate solution goes with the 6- 
by-516-cel] layout, but stores three or four sets 
of six video lines in the cells using each ele- 
ment’s 128 bits of RAM. Four sets of six rows 
would occupy only 82 RAM locations when 8-bit 
words are used. If this scheme leaves enough 
RAM to perform the necessary computations, it 
is an attractive solution, since 24 video lines can 
be processed simultaneously with a single row 
of chips. 

Other systems can be configured using the 
same basic approach. Varying data input rates 
and the type of data being processed, though, 
will force minor changes in the actual arrange- 
ment of the blocks. One variation on the basic 
system enables it to be put to work enhancing 
images in computer-aided tomography. Im- 
proving the quality of the images from a CAT 
scanner has the obvious advantage of giving 
the physician a better chance to locate and 
identify abnormalities. 

Since input from the scanner comes in large 
blocks of data—every few minutes or seconds— 
the system also requires an input buffer. It 
must have a high-speed input port, although it 
can send data to the block of arrays at a slower 
rate. 

The variety of algorithms needed for the 
many chores of a physician mandate a fairly 
large memory buffer for the controller. The 
main block of processor elements should be rec- 
tangular; the total number of cells will again 
depend on the algorithmic load and the fre- 
quency of input. It might also be a good idea for 
the system to have an interactive interface so 
that the doctor can refine the image while it is 
being viewed.0 
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Associative memory 
calls on the talents 
of systolic array chip 





A monolithic systolic array puts its on-chip 
memory to good use, first searching out the desired 
information and then processing it. 





This is the fourth in a series of articles dedicated to the 
first commercial systolic array processor chip. The 
series began with the cover article in the Oct. 31 tssue 
(p. 207) and has continued in every consecutive issue. 


ow to retrieve data from memory isa 
classic problem for designers. Theoreti- 
cally, one of the simplest ways to get in- 
formation is to match the memory contents 
with the desired key, much as an instructor 
calls upon pupils to volunteer an answer toa 
question. However, processors traditionally 
force programmers to call memory using ad- 
dresses. That scheme often makes for relatively 
slow processing. Searching through a set of 
numbers for the one with the highest value, for 
instance, typically forces a von Neumann pro- 
cessor to examine each member in the set at 
least once. 

An alternative is an associative memory 
system, which matches some part of the desired 
data within memory instead of requiring ad- 
dresses. Known by a host of names, including 
content-addressable memory, data-addressed 
memory, parallel search memory, and search 
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associative memory, the technique has been 
called into service a number of times in various 
applications. But such systems are usually fair- 
ly large and expensive. 

The Geometric Arithmetic Parallel Pro- 
cessor, or GAPP, takes a new approach to this 
long-standing dilemma. The first systolic array 
processor chip, the device is not only well suited 
to associative memory but can be configured in 
a comparably small system as well. Further, 
since it performs logical and arithmetic oper- 
ations, it also serves as an associative processor 
that works on the data found in a memory 
search. 

The chip carries 72 single-bit processors that 
run in parallel asa single-instruction, multiple- 
data path system. Each processor element is 
fitted out with 128 bits of dedicated RAM. Also 
vital to associative memory is the IC’s global 
broadcast function, which lets users transmit 
data, such as the search word, to all of the pro- 
cessor elements simultaneously. 

The tremendous speed advantage of parallel 
processing over conventional methods is dem- 
onstrated in the relative time needed to per- 
form associative memory searches. Searching 
for a single number in a set with N members, a 
traditional processor running an algorithm 
that lets it look just once at each number takes 
up to N cycles to locate the desired data. The as- 
sociative processor, in contrast, needs M cycles, 
where M equals the number of bitsin the target 
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information. The device interrogates the 
memory entries in parallel, and once the proper 
information has been found, it can be either 
processed by the associated processor or passed 
directly to the host. 

An associative memory unit can be broken 
down into two main components: an associative 
array and an associative array controller (Fig. 
1). The arrangement is similar to traditional 
memories, which consist of an array of address- 
able cells and acontroller. 


Other functions 


In addition to managing the memory, the 
controller also handles sequencing. It contains 
two registers: one holds the data that the host is 
looking for, called the “comparand”; the other 
contains a mask, which screens out unwanted 
bits during processing. 

The associative array comprises a group of 
cells, each of which generally holds a single 


Associative array 
controller 


Target-word, or comparand, register 


Associative array 





1. An associative memory consists of a controller 
and an array of memory cells. The first section man- 
ages the second and stores the target word—or 
comparand—and the mask, both of which come 
from the host. The tag bits attached to each cell de- 
note what data is located at any one. 


word. Each cell has a tag bit to notify the con- 
troller when its data matches the sought-after 
word. When there is a match, the tag bit is set 
and that cell is termed a responder. If the de- 
sired data is not in that cell, the tag bit remains 
unchanged and the cell is dubbed a nonre- 
sponder. 

Every cell within an associative array also 
performs three tasks—compare, write, and 
read. The first and most fundamental task 
simply compares the masked target data with 
the contents of the cells, setting the tag bit if 
there is a match. The second then writes a data 
word to the responding cells. Read, the final 
function, shifts the contents of the responding 
cells to the output bus. If there is more than one 
responder, the output is the bit-by-bit logical 
OR of all the responding cells. These three func- 
tions are carried outonly on datain the respond- 
ing cells; nonresponding cells are untouched. 

This architecture can easily implement the 
associative memory and handle associative 
processing tasks. The feature set of the systolic 
array chip performs all the operations needed 
to form an associative array, and when coupled 
with a programmable control circuit, it also 
serves as an associative processor. 


Maintaining control 


The GAPP chips themselves form the asso- 
ciative array. However, since these devices do 
not have any control features, an external con- 
troller must be used to sequence the array and 
to oversee the target-word and the mask reg- 
ister. It consists of a control store, address se- 
quencer, and host computer (Fig. 2). 

The associative array can be built in various 
sizes, since the chips can be ganged together to 
create larger arrays. Because each processor 
element within a grid of chips is a serial pro- 
cessor, word widths greater than one bit must 
be emulated serially. The on-chip RAM lets the 
elements handle a variety of word widths, as 
well as multiple words and multiple tag bits 
since every RAM location is individually ad- 
dressable. Further, there will be some RAM left 
over that can be used as a scratchpad or for 
storing arithmetic operands and other data. 

Input and output for the full array are 
handled via the communications (CM) bus. In- 
put usually comes from traditional] processors 
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in word-serial and bit-parallel form. However, 
because processor elements accept only word- 
parallel and bit-serial data, incoming informa- 
tion must be reformatted. This job, often 
referred to as corner turning, can be imple- 
mented with GAPP devices or with special 


The control section can be set up ina number 
of ways, but the most effective is to use a pro- 
grammable controller. Whatever route is 
taken, though, the controller remains respon- 
sible for generating instructions and addresses 
for the array and for sampling the output of the 


responder detection circuit. The control store 
receives its list of instructions from the host. 
Bit seria! data can quickly be sent to all pro- 
cessor elements using the global input, which is 
easily accomplished by employing the op code 
lines to command the C register to load either a 
loraQ. 

Programming the associative processor 1s 
different from programming a conventional 
processor with RAM. That is due both to the dif- 
ferent type of search involved and to the array’s 
architecture. In operation, the instructions 
supplied by the contro] unit to the associative 
array are mnemonics for the chip. The control- 


parallel-to-serial circuitry. 
Doubling up 


The north/south (NS) register of the chip 
pulls double duty, acting both as a place to store 
the tag bit and as an area to carry out various 
| functions. The tags of all the processor ele- 


ments can be quickly sent to the controller us- 

ing the global output signal (see first article in 

series, ELECTRONIC DESIGN, Oct. 31, p. 207, for 
i definition of global output). Several GAPP 
chips can then be combined in a wired-OR con- 

4 figuration to generate a responder signal for 
| the control unit. 


Responder 
+ (global output) 


; Grid of 
48 X 48 
associative 
processor cells 
(32 GAPP chips) 


Mask register 


Address 


pargetexo sequencer 


register 


| 4k X 24-bit 


control 
store 


Ncees 
: CMN 
ne Line buffer 
array of 
412 X 48 
processor elements 
cms (8 GAPP chips) 


Bus to host —————» 





2. An associative memory controller (highlighted) consists of a control 
store, which holds commands coming from the host, and a sequencer. 
The sequencer can use data from responders to branch to another part 
of a program. 
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ease of using primitives can be seen in the exact 

match function, which differs from the com- 

pare primitive only in that the latter emulates a 

aoe machine by operating on a word bit 
y bit. 

As in the foregoing example, the responders 
are searched for an exact match to the masked 
target-word register, with matches indicated 
by setting the tag bit. Specifically, a response 
bit is maintained at location Rinthe RAM 
(Program 2). If Ris set, that particular cell con- 
tains a possible match to the broadcast data. If 
R is clear, that cell does not contain a possible 
match. At the end of the algorithm, all cells 
marked as possible matches are determined to 
be exact matches for the full word. This ap- 
proach takes 5.5M+2 cycles, with M again rep- 
resenting the number of bits being compared. 
(This rate assumes that the number of 1s and 0s 
is equal.) 

The write primitive is also handled witha 
bit-wide scheme. It, too, can be repeated for 
larger words. In this algorithm, the tag bit is 
restored to the NS register near the end of the 


Program 3. A writing lesson 


/* Write function */ 
/*Load contents of addr into ew*/ 
ew: =ram(addr)}, c:=0: 
/*produce AND of not(tag),assumed to be in NS, with 
contents of addr*/ 
c:=bw; 
ram(temp): =c, c:= 1: 
/*load value into ew, tag assumed to be in NS*/ 
if(vaiue = =Q) 
ew:=0, c:=0: 
else 
ew:=c, c:=0; 
/* AND value and tag*/ 
c:=cy: 
/*Load intermediate values in anticipation of OR*/ 
ns:=c, ew:=ram(temp}, c:= 1; 
/“perform OR and restore tag*/ 
c:™ cy, ns: =ram(tag); 
ram(addr}: = c; 
















program to ensure that multiple invocations 
will proceed smoothly. The restoration is com- 
bined with an OR operation, so no added cycles 
are demanded. 


First, load the cells 


The write algorithm begins by placing a Bool- 
ean value into each of the responder cells (Pro- 
gram 3). Since the chip treats all locations 
equally, some care must be taken to write only 
to the responders. To be sure that the non- 
responding cells are not modified, the tag bit for 
each responder must first be inverted and then 
ANDed with the data in RAM. 

Once this is done, the tag of the origina] re- 
sponders is ANDed with the new value. The re- 
sults are then ORed together; this effectively 
blocks the nonresponders from accepting the 
new value. The data is then written into the 
addr of the responders. 

As mentioned, these primitives can be used 
as building blocks for more complex tasks. A 
limit search, for example, finds the set of re- 
sponders with a value greater than, greater 
than or equal to, less than, or less than or equal 
to a particular number. All of this can be done 
with just the compare and write algorithms. 
Basically, all the responders are designated as 
greater than, equal to, or less than the desired 
value. Once that is done, it is simple to select 
any combination of sets. 


Call and response 


The three response bits, X, Y, and Z, all are 
held in RAM (Program 4). The first either indi- 
cates equality or denotes an undecided state 
that requires further processing. The second 
designates greater than, and Z signifies less 
than. Initially, all responding cells are set to the 
undecided state, or 100. If the MSB of the target 
number is a 1, then all the undecided elements 
have a0 as their MSBs will be marked less than. 
If the target’s MSB is a 0, all the elements that 
show al as their MSBs will be marked greater 
than. This sequence continues through the full 
number of bits making up the target, and at the 
end of the word, those cells in which all bits are 
undecided are determined to be equal to the de- 
sired number. The limit search takes 24+29M 
cycles. 

Once a responder or class of responders is 
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ler language in the algorithms used by the chip 
employs a syntax similar to that of the C lan- 
guage. References to the bits in the mask and 
target-word registers take the form of a bit 
number. The register’s MSBis assigned a0, and 
the LSB is designated M—1, where M equals the 
word length. The RAM associated with each 
processor element is similarly labeled, with the 


Program 1. Making a comparison 


/* Compare function */ 
/* Load the NS and EW Registers’/ 




















if (value = = 0) 

ew:=0, ns=ram (addr), c:= 1; 
else | 

c:=t: 


ew:=Cc, ns:=ram (addr): 
{ 
/*EXNOR into NS reg*/ 


ns:=ram (temp),ram(temp): =sm; 
/* AND result with TAG’ / 

ew: =ram(tag),c:=0, ns: =ns; 
c:=cy; 

/*place results in RAM and NS*/ 
ram(tag):=c, ns:=c; 


Program 2. The match game 


/* TAG exists in NS */ 

/*load tag into R*/ 
c:=0, ew: = 0; 

ramiR)}:=sm; 

/*Loop for every bit in the word*/ 

tor{i=Q; i < M, i+ +)} 

if(mask(i) = = 1)then} 


/*Load ew and ns with the values to be compared*/ 











if (comparand(i) = = O0)then 

ew: = 0, ns:=ram(cell+ i}; c:=90; 
etse; 

c=1; 


ew:=c, ns:=ramicell +i), c:=0; 


| 

/*EXOR operands*/ 
ns: =ram(temp), ram(temp): = sm, 
/*Update R bit and tag*/ 

ew: =ram(R): c:=0; 

c= cy; 
ram(R)}:=c, ns:=Cc; 
t 

{ 








contents of the data word appearing in loca- 
tions “Cell” through “Cell +M—1”. Response 
bits are stored in additional RAM locations and 
are typically used by the algorithms. 


Time for our program 


Since, as noted, a system consisting of GAPP 
chips is bit-serial, the most effective primitives 
are 1 bit wide. The bit-wide comparison prim- 
itive (Program 1) loads the comparand bit into 
all of the processor elements using one instruc- 
tion, an important technique for parallel data 
input. This is accomplished by using the set 0 in- 
struction of the East/West (EW) registers and 
the set 1 instruction of the C register. 

The compare algorithm also makes use of a 
time-saving scheme concerning the tag bit. 
Since the NS register, which normally retains 
the tag bit, is needed for many other tasks, the 
bit is simultaneously written toa RAM location 
reserved for the tag (last line of Program 1). At 
the completion of any function that reports re- 
sults by marking the responders using the tag, 
the tag bit also appears in the NS register. 

In each of the responding cells, the compare 
algorithm sets the tag bit if the value residing 
at agiven RAM address (addr) matches the 
value sent from the host. This address can be 
any number between 0 and 127. 

The algorithm starts off by first loading the 
NS register with the value stored at addr. At the 
same time, the EW register is loaded with the 
value sent from the host. Next, the EW and NS 
registers are exclusive-NORed, and the result is 
placed in the NS register, where it is ANDed 
with the tag bit. The result of the last operation 
is placed in both the tag location and inthe NS 
register. Although this describes a 1-bit search, 
it could easily be invoked repeatedly to operate 
with larger words. 


Five or six cycles 


Only five cycles are needed to compare a 0, six 
to compare a Ll. In all of the examples, the as- 
sumption is made that the controller can keep 
the associative array operating at maximum 
speed at all times, so that no wait stages are re- 
quired. 

Primitives such as compare increase the pow- 
er available for larger programming tasks and 
also simplify the programmer’s chores. The 
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third read does just that, employing the re- 
sponder signal supplied by the associative 
array to read the value of the responding cells. 
Doing so places the data in the responding cells 
at addr in the NS register. The routine: 


/*Load ns with TAG*/ 
ns: =ram(tag); 

/*Load ew with data’/ 
ew: =ram(addr), c: =0; 
/* AND tag and data’/ 
c:=cy; 

/*Place results in NS */ 
ns:=C; 


makes good use of the fact that the tag bit is 
stored in memory, so it can be used con- 
secutively a number of times. The read is called 
up by logically AN Ding the tag at ram (tag) with 
the data at addr and then placing the results in 
the NS register. 

A read of this type would be desirable when- 
ever it is necessary to address portions of the 
array individually at a unique address—as with 
a conventional memory — instead of by the con- 
tent of memory. This is done by loading thedata 
words over the CM bus so that each elements 
holds a unique number in its RAM. An exact 
match search (using either primitives or the 
larger program) could then be executed and the 
cell contents read. On a grid of systolic arrays 
516 elements on a side, the exact match would 
take a maximum of 114 cycles and the read 
would take 4 X M evcles. 


Easing into associative processing 


Once an associative memory has been built 
from GAPP chips, it is relatively simple to en- 
hance it to create an associative processor. The 
primitives and techniques detailed permit a full 
range of searches, which can be combined with 
arithmetic or logical operations to further in- 
crease the system’s capabilities. 

As simple as it sounds, one of the functions 
most useful to associative memories Is count- 
ing, or simply summing the number of re- 
sponders. Its importance can be seen in a limit 
search, in which it might be vital to know how 
many bits fall within a set of limits. Doing so 
relies on the chip’s ability to perform serial 
arithmetic, keeping track of the number of cells 
that have the tag bits set. 


A related function, called first, lets users pro- 
cess the bits on an individual basis. Once it is 
determined that there are multiple responders, 
it becomes important to turn off all but the de- 
sired one, so that the others are not altered 
when the operation takes place. 

One technique is to propagate a marker in the 
northwest corner cell of the array and carry it 
eastward along the top row until the eastern 
edge of the array is reached. It can then drop 
down a row and move in the opposite direction, 
continuing with its serpentine route across the 
full array. Since the marker resides in only one 
cell at a time, only that cell is on, so it is the only 
one being addressed at any given time. Sampl- 
ing the responder after each shift of the marker 
makes it possible to know when the first re- 
sponder has been reached. Another technique 
would be to prioritize multiple responders with 
the unique PE address. 

A final algorithm takes advantage of some of 
the processor element’s logical functions to cal- 
culate the number of bit positions by which the 
corresponding digits in each cell differ from 
those of the desired value. This number, or 
Hamming distance, forms the basis of numer- 
ous error correction codes. The algorithm: 


/*Store Zero in COUNT field of al! Candidates*/ 
for (i=0,i<cM;i+ +) 
WRITE(COUNT + i, 0); 
/*Compare ail bits * / 
for(i=0;i<M;i+ +) { 
SET; 
COMPARE(CELL + i,COMPARAND(i)): 
ADDONE; 


is effectively a search for no match, which is fol- 
lowed by incrementing a count field in all re- 
sponders. Once the Hamming distance is deter- 
mined for each word in the associative array, a 
least-value search can be performed on the 
count field to find the word with the smallest 
distance. Every time no match is found, the 
counter is incremented by one.O 
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identified, it is often important to interrogate 
the array to read its values. The most obvious 
way to accomplish this is to shift the contents of 
each cell out via the CM bus. The data words 
would have to be prefaced by the tag bit, so that 
the external circuitry can determine which of 
the words are responders. Obviously, this 
approach operates on both responders and non- 
responders, but this will not matter in many ap- 
plications. 

Nevertheless, there are two drawbacks to 
this method. First, if a large array is chosen to 
increase throughput, the time needed to shift 
all the data out can be significant (see the table 
below). Second, the requisite external circuitry 


Program 4. in search of limits 


/*Mark all elements as undecided*/ 
WRITE(X, 1); 
WRITE(Y, 0): 
WRITE(Z, 0); 





























/*Loop through all bits*/ 
for(i=0:i< M,i+ +)| 

/*Reset Tag*/ 

ns=0; 

if(comparand(i) = = 0)} 
/*locate all undecided cells*/ 






COMPARE(X, 1); 
/*locate greater than cells*/ 
COMPARE(cell +i. 1); 
/*mark responders as greater than”/ 
WRITE(X, 0); 
WRITE(Y, 1): 
| 
else} 
/*locate all undecided cells* / 
COMPARE(X, 1); 
/*locate tess than cells*/ 
COMPARE(cell +i, 0); 
/*mark responders as less than’/ 
WRITE(X, 0); 
WRITE(Z, 0); 


Loading or reading time for a single-bit piane 


Dimensions Total number : 
of array of processor | Total number 
(processor elements} elements of GAPP chip 












48 X 48 2304 
432 X 132 17,424 | 
516 xX 516 266,256 

1032 x 1032 


1,065,024 


inherent to this scheme is substantial. How- 
ever, in tasks like image processing, where large 
portions of the data are of interest, this still 
might be an effective technique. 


A question of comparison 


A second approach simply employs the com- 
pare instruction to read the value of a bit using 
the responder signal. Comparing addr with a 0 
effectively shifts the data in along with the re- 
sponder signal, where it can then be shifted into 
the target-word register. A compare between 
the addr and 1 shifts the inverted data in with 
the responder signal. If there is more than one 
responder, the OR of all the data bits is posi- 
tioned on the responder signal. 

This method can be called into play when it is 
necessary, for instance, to establish the num- 
bers of pixels that are at their maximum values. 
The algorithm for determining which of the 
words in memory has the highest value uses the 
responders first as decision-making data and 
then to read the maximum value. The first two 
steps of the following algorithm effectively 
amounts to an instruction that sets all the tag 
bits: 

for (i=0;i<M, i+ +)} 

/*search for undecided cells” / 

SET; 

COMPARE(R, 1); 

COMPARE((CELL +i), 1); 

if (RESPONDER) 
WRITEN(R,(); 


The program first scans all the memory cells, 
looking at the most significant bits. If any cel] 
has a 1 in that position, it is noted that it holds 
the highest value. All the cells with 0s are 
marked as storing less than highest. When the 
bits are all processed, the cells containing the 
highest value at each bit are determined to 
contain the maximum word value. This will 
take 14 cycles for each bit of the data. 


Cutting down on overhead 


The program uses the compare algorithm 
whether or not the data is to be read out. It is 
possible, though, to read the contents of a cell 
without incurring the overhead associated with 
the exclusive-OR used in a comparison. The 
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tially processing each bit, that is, it is a word- 
parallel, bit-seria! processor. 

A basic relational] data-base scheme can be 
implemented with the systolic array, but since 
its architecture is not similar to that of von 
Neumann processors, the architecture of the 
relational data-base system will also differ. 
The systolic array processor can be used ina 
subsystem or as part of another system. 


An accent on speed 


The chip’s single-instruction, multiple data- 
path (SIMD) architecture increases through- 
put in common tasks within the relational data 
base. The chip, arranged in rows and columns of 
processing elements, handles rows and columns 
of data. 


For instance, in a relational Join operation - 


(Fig. 1), two tables (a) are linked to create a 
third table (b). The array forms each row in the 
result by joining two rows, one from each of the 
tables. The selected rows each have a common 


John 
Peter 
Michelle 


John 
Peter 
Michelle 
Michelle 





element. A Semi-Join operation produces an 
output from only one of the tables, but the 
items selected for output depend upon the sec- 
ond table (c). 

In a typical processing system, each of these 
steps must be handled sequentially. However, 
the systolic array can work with all the rows 
and tables at the same time, producing the Join 
tables much faster. Each processor element in 
the array works on an individual] datum (item 
of data). The array can be used with various 
data base formats: a number of chips can be tied 
together to make a large grid of processor ele- 
ments, thereby increasing throughput. 

Since a parallel processor handles data faster 
than its traditional counterpart, its I/O rates 
must correspondingly be faster. Mainframe 
parallel processors have used three major ap- 
proaches, and all can be used with systems built 
with the systolic array. The earliest technique 
dedicated a head for each track of the fixed disk 
that the data-base management system used 


Math , Management 301 Michelle 
English Management 301 Pater 
French Math 201 John 

Math 201 Michelle 


Math Math 201 
English Management 301 
Franch Management 301 


John Math 
Michelle French 


French Math 201 (c} 


1. The systolic array chip can easily handle such relational data-base 
operations as Join and Semi-Join. A Join operation, for instance, links 
files 1 and 2 (a) to form file 3 (b). The Semi-Join operation then sear- 
ches the newly formed table, pulling out which students are enrolled in 
a mathematics class (c). 
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Systolic arrays fill the bill 
- as data-base management 
heads for gigabyte range 





Parallel-processing building blocks with distributed 
memories offer speed and ease of use in systems 
where von Neumann architectures would falter. 





This is the last in a five-part series on the first com- 
mercial systolic array processor chip. The initial artt- 
cle was the cover story of the Oct. $1 issue. and with 
the exception of Dec. 27, an installment has appeared 
in every succeeding issue. 


ata-base management has become in- 
Pl) essines important in recent years, 

especially for relational data bases. 
However, as the size of these data bases moves 
toward the gigabyte range, conventional von 
Neumann architectures are too slow to meet de- 
mands. In large part, this is because the basics 
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of data-base management — storing, retrieving, 
searching, updating, deleting, merging and or- 
dering data—are not numerical operations. 
System designers must spend considerable 
time translating the basic data-base commands 
into host instruction sets for use on conven- 
tional processors. 

Compounding this problem is a conflict be- 
tween the requirements of operating systems 
and data-base managers. Ideally, a data-base 
system should store indexes in locations that 
contain only the information the indexes refer- 
ence. However, operating systems distribute 
data to make the best use of available storage. 
Moreover, there is a tendency in von Neumann 
virtual memory systems to swap out pages of 
data frequently used by the data base. The big- 
gest bottleneck of the von Neumann architec- 
ture is that all data processing is sequential. 
The net result of all these factors is a great in- 
crease in the data that a system must get from 
memory —sometimes 10 times more than is ac- 
tually needed. 

A practical solution is to develop parallel pro- 
cessing data-base management systems, and 
designers can do this with the Geometric Arith- 
metic Parallel Processor (GAPP), a systolic 
array containing 72 single-bit processors. Each 
processor has 128 bits of RAM. The chip, the 
first commercially available two-dimensional] 
systolic array, processes data words in parallel, 
working on words of varying lengths by sequen- 
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one of the RAM addresses. When the desired 
word is found, its replacement can easily be 
substituted while data streams through the 
pai at rates up to a million characters per sec- 
ond. 


Shuffling the cards 


Another major task of any data-base man- 
agement system is sorting. This involves a bit- 
comparison operation that is somewhat similar 
to the compare operation. But in comparing bits 
for sorting, the system must determine not only 
whether the data in the NS register matches 
that in the EW register, but also whether it is 
greater or less than that contained in the EW 
register. 

The system does this three-way comparison 
by examining both the Sum (SM) and Borrow 
(BW) outputs from the ALU (see the table, 
below). If condition C = Qand thedataintheNS 
register matches the data in the EW register, 
then SM = 0. If the data in the NS register does 
not match, then SM = 1. However, when the 
data is not matched, BW = 1 if the data in the 
NS register is less than the data in the EW reg- 
ister. Alternatively, BW = 0 if the data in the 
NS register is greater than data in the EW reg- 
ister. The “match-finding” program, performs 
this operation and stores the SM and BW res- 
ults in RAM 2 and RAM 8, respectively. 

This greater-than or less-than comparison 
forms the basis of the second building block, the 
sorter. It performs a task common to all data- 
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base systems: reordering data, based on the 
user’s request. With minor alterations, the tra- 
ditional ordering and sorting algorithms used 
with conventional serial processors can be eas- 
ily switched to take advantage of the parallel 
processing of the systolic array. The result is an 
increase in performance. 

The sorting algorithm used ina systolic array 
system is basically just a parallel version of the 
classical exchange sort, in which pairs of num- 
bers or other data are exchanged to reflect their 
relative position in the sorting order. For in- 
stance, pairs of data are first compared in the 
order in which they are found in memory (1 and 
2,3 and 4). Then they are paired again for the 
next comparison (2 and 3, 4 and 5). At each com- 
parison, the pairs of items not in order are ex- 
changed (Fig. 3). 


Getting a perfect match 


To implement such a sorting algorithm with 
the systolic array chip, two strings comprising 
records A and B must be loaded into the array, 
much as was done in the comparator block. 
Each record is again assumed to consist of 
12 characters of 6 bits each. Once they are load- 
ed in, the SM and BW output bits are computed 
to find matches (see the program, p. 354). 

If the records do not match, the system must 
determine whether record A is less than or 
greater than record B. The systolic array does 
this by searching the bit string until the first 
unmatched character is found, then identifying 
the MSB that does not match, again using bit 
comparison. 

The array does this by shifting a marker, or 
“1” bit, in serpentine fashion through all the 
processor elements as shown in Figure 2. The 
maker propagates until it finds the “FIRST” re- 
sponder, where RAM2=1. Then RAM3 is exam- 
ined in that processor element to determine 
whether the records should be swapped. 

During this operation the device must make 
an exchange or no-exchange decision before the 
records are loaded into the next row of sorter 
blocks. This may require the marker to propa- 
gate through all 72 processor elements within 
some chips. Meanwhile, other chips in the grid 
are idle until all devices have completed their 
swaps. 

Each sorting cycle for comparing pairs takes 
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for storage. The technique also proposed a pro- 
cessor for each multiple head to furnish the 
utmost in throughput. 

But the high cost of multiple head disks, cou- 
pled with the expense of individual processors 
for each track, forced compromises. Now, high 
speed, moving-head technology has been devel- 
oped so that there is only one head for each disk 
surface, greatly trimming the requisite number 
of heads and processors. 

As memory prices have dropped, data-base 
architectures have moved toward cache mem- 
ories. Large disk caches increase speed, min- 
imizing the number of disk accesses while deliv- 
ering faster transfer rates than are possible 
with off-the-disk approaches. 

Regardless of the approach taken, the systol- 
ic array will generally be used only to perform 
dedicated data-base processing: its architec- 
ture does not lend itself to the diversity of tasks 
that must be performed by a host computer. 


Block by block 


As with other set-ups discussed in this series, 
the systolic array data-base management 
system can be built building-block style. 
Groups of GAPP devices can be put together, to 
form an SIMD array of the required size. In ad- 
dition, several different blocks can each per- 
form specific tasks so that the complete data 
base machine operates as a multiple-instruc- 
tion, multiple-data-path (MIMD) system. 

One block, for example, can address the basic 
task of any data-base manager: searching the 
memory to locate some required information or 
to determine that it is not in the system. The 
systolic array easily makes such comparisons 
in parallel. Consider a search for a 12-character 
comparand (the comparand is the required 
data that is compared with the data in memo- 
ry). If each character is made up of 6 bits, the 
12-character search can be handled neatly by 
the 72 processor elements of one chip (Fig. 2). 

The code for loading the comparand into the 
chip is simple: CM:=CMS (repeated 12 times) 
followed by the instruction EW: = RAMO, 
RAMO=CM. The EW register, one of four reg- 
isters in each processor element, is loaded via 
the CMS line. Since several systolic arrays can 
be linked to form grids, it is easy to increase the 
size of a grid to match the typical word size and 


processing rates of the system. 

As the data is being fed into the grid, it 
streams through the chips, entering on the 
south and exiting at the north. After each char- 
acter is clocked into the array, an exclusive OR 
comparison is performed and the result placed 
in the NS register. If all characters in the array 
match, then the global output (GO) flag is high. 


I don’t care 


The comparand can be masked so unimpor- 
tant characters or bits represent a “don’t-care” 
condition. Locations being masked with the 
“don’t-care” condition will place a Qin RAM 2; 
while all other locations will have a lin RAM 2. 
The result of the exclusive OR comparison is 
then ANDed with the mask in RAM2 before the 
result is placed in the NS register. 

This type of exact match is useful for search- 
ing text files for specific information. In a com- 
mon variation of this operation, like “find and 
replace,” the replacement word can be stored in 


GAPP chips 
(6 X 12) 


Input 
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or record 
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2. The chip can accept and process input record A, 
which comes from memory, and the comparand or 
input record B. If it cannot complete the processing 
in one pass, the signal can be wrapped around and 
run through again. 
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matches for both the time and salary constants. 
The results are then passed to a third block, 
which performs the AND function that deter- 
mines acceptance or rejection for the two-part 
query. 

Table settings 


A related function that can be performed 
much more quickly with parallel processors is 
a multiple file operation like Join orSemi-Join. 
A Semi-Join operation, the most common in 
most data-base processing systems, produces a 
subset of one table; this subset is determined by 
a relationship from a second table. 

This operation requires both sorter and com- 


pare blocks (Fig. 4). The search for “employees 
working on the data-base project” is a typical 
Semi-Join operation; it uses two files. The two 
in this example have a matching component, 
the Social Security number, that lets the sys- 
tem list the personnel on the project, even 
though their names are not in the project file. 
The task can be performed by several blocks of 
GAPP devices. The two files are passed through 
two sorter blocks. There the lists are put in or- 
der employing the Social Security numbers. 

A third block, a comparator, compares the 
data in the project record in memory to the con- 
stant—in this case, the “data-base” manage- 
ment project—selecting only the records that 


A match-finding program 


CM:=0, NS: =0, EW:=0, C:=0, RAMO:=C 
CM:=CMS, NS:=S, EW:=0, C:=0, RAMO:=C 


/* Initialize */ 
/* Repeat this instruction until all 


characters of the two records to be compared are loaded into the array °/ 


CM:=CM, NS:=NS, EW:=0, C:=0, RAMO:*SM 


/* Write record A to RAM 0 */ 


CM:=CM, NS:=NS, EW:=RAM1, C:=0, RAM1:=CM /* Write record B to RAM 1 and EW */ 
CM:=CM, NS:=RAM2, EW:=0, C:= BW, RAM2:=SM = /* Write SM = A @ 8B to RAM 2 and 


CM: =CM, NS:=NS, EW: =0, C:=0, RAM3:=C 


NS‘/ = 
/* Write BW = A « B to RAM 3 °/ 
/* Test GO: ff GO = 1 then 


record A matches record B */ 


Generate name 


Name trom file 1 
is put out when 
SSN from that 
fie 8 matched 
by GAPP 4 


if a match 


Sorted Ust of 
SSN (output only 
project = data-base 

menagemen 





4. A configuration of parallel processor building blocks can tackle data-base 
tasks. Here four chips search through two files to find the names of workers 
involved in a date-base management project. 
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less than 24 us. Thus eight records can be sorted 
through the eight stages in 192 us, and eight 
more records can be entered into the pipeline 
every 24 us. 

A sort using the systolic array chips will take 
N Steps to process N elements, compared with 
N* steps for a typical von Neumann processor. 


Forging links 


Systolic array building blocks can easily be 
linked into systems that perform the data-base 
functions. The comparator block alone can per- 
form queries involving only one file, such as 
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“Find all emplovees hired before 1980.” 

The data-base records are read in parallel 
and loaded into the grid of systolic array chips. 
Both the query field, “date of employment,” and 
the constant, “1980” are loaded into the com- 
parator. As the compare function is performed, 
the records that are less than “1980” are sent to 
the host, while the others are ignored. 

The use of two comparator blocks allows de- 
signers to perform more complex tasks, like 
finding “employees hired after 1980 whose 
salary is greater than $25,000.” Two blocks of 
systolic arrays can search in parallel to find 


After final sort 


After first sort 


These devices serve onty as 
torage/delay elements. 





3. The array chip compares the values “a, b,c...” 


along the bottom of 


each row of processor elements and then rearranges them in the de- 


sired rank until the proper values reach the top. 
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match. The results are then passed to another 
comparator block, with only the Social Security 
number going to this group of systolic arrays. 
This sorter block then looks for matches with 
the file that has both names and Social Security 
numbers. When a Social Security number from 
the employee file is not in the list on the com- 
parator block, that record is removed from the 
buffer by the selector. When the two records 
match, the employee’s name is passed through 
the selector to the host computer. This type of 
parallel processing architecture provides 
several orders of magnitude higher throughput 
than a von Neumann machine. 

By using building blocks in this fashion, a de- 
signer can easily create a full system divided in- 
to five units: the host, the systolic array con- 
troller, the systolic array blocks, a switching 
matrix, and a storage device (Fig. 5). 

The host performs typical tasks, including 
processing and compiling queries, issuing com- 
mands to the systolic array, and receiving 





responses, as well as handling user communica- 
tions and interfacing. The GAPP controller 
need be no more than a dedicated microcomput- 
er. It receives commands from the host and an- 
alyzes them, then dispatches programs and I/O 
commands for the array. It also receives the 
output from the arrays when the tasks are com- 
pleted. The controller then selects data from 
this response and sends the appropriate data to 
the host. Each block of chips in the system has 
an address, so the controller can distribute pro- 
grams over the appropriate control lines. 

The switching matrix serves as the link be- 
tween the storage device and the systolic array 
building blocks. Storage can be handled by 
disks or cache memory.o 


5. Systolic array chips can be called into duty in a data-base man- 
agement system, forming a processor block, a controller, and 8 
switching matrix. The matrix prepares data for array processing. 
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WASHINGTON 

1750 132nd Ave.. N.E. 
Bellevue. WA 98005 
{206} 453-8300 


MANHATTAN SKYLINE 


UNITED KINGDOM 
Manhattan House 

Bridge Road 

Maidenhead 

Berkshire SL6 8DB 
England 

Maidenhead (0628) 75851 


NCR Microelectronics 


EASTERN AREA SALES OFFICE 


NCR Microelectronics Division 
400 W. Cummings Park 

Suite 2750 

Woburn, MA 01801 

Phone: (617) 933-0778 


CENTRAL AREA SALES OFFICE 


NCR Microelectronics Division 
400 Chishoim Place 

Suite 100 

Piano, TX 75075 

Phone: (214) 578-9113 


WESTERN AREA SALES OFFICE 


NCR Microelectronics Division 
4655 Old jronsides Drive 
Suite 400 

Santa Clara, CA 95050 
Phone: (408) 727-6575 
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NCR Microelectronics Division 2001 Danfield Ct. Fort Collins, Cotorado 80525 
Telex: 045-4505 NCRMICRO FTCN Phone: 303/226-9500 303/223-5100 
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NCR’s Commitment to Quality 


As a pioneer in microelectronic tech- 
nology, NCR has been manufacturing 
components for its own product line 
since 1971. This experience has pro- 
vided opportunities to learn about 
user application problems, the impor- 
tance of component quality and re- 
liability, and their effects on total 
system reliability. The net result of 
such experience is a dedication to 
manufacturing superior components 
based on a firm commitment to quality 
and reliability. 


| NCR Quality Assurance completes a 
rigorous evaluation of each product to 
ensure conformance of the product to 

| its specification. Once a component is 
approved for production, stringent 
process and assembly controls aiong 
with detailed inspections are used to 
build in reliability. Comprehensive 
electrical testing is performed to 
guarantee the performance of each 

| component: finished products are in- 
spected before shipment to assure the 
conformance to specification of each 
jot of devices, and sampling plans are 

[ constantly revised and updated to im- 
prove quality. 

Essential to any reliability program is 

[ feedback from the system user— 

communication that is vital for reliabil- 
ity growth. NCR strives to ‘close the 
loop’ by communicating with users to 
| evaluate problems and respond with 
corrective action. The closed-loop 
concept results in better understand- 
ing of user needs while improving re- 
| liability. 
The NCR commitment to quality and 
reliability is an integral part of cor- 

[ porate philosophy originating from 
and emphasized by the highest levels 
of NCR management. This manage- 
ment direction, combined with NCR's 

[ manufacturing and user application 
experience, provides a solid frame- 
work for continued improvement in 

| quality and reliability. 
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NCR Microelectronics 


NCR, a multi-billion-dollar manufac- 
turer of computer systems, terminal 
products, and semiconductors, es- 
tablished its first microelectronics labo- 
ratory in 1963 to stay abreast of the 
emerging semiconductor technology. 
The laboratory was expanded in 1966 
to provide limited quantities of proto- 
type microcircuits designed for use in 
a number of new products. By 1968 
the first MOS circuits were produced, 
and by 1970 a complete family of cir- 
cuits had been designed, produced in 
prototype quantities, and incorporated 
into new NCR products. Based upon 
knowledge gained in this research 
and confidence in the ultimate advan- 
tages of MOS, the decision was made 
to expand the internal production ca- 
pability. In 1971, the Miamisburg, 
Ohio plant was completed. 


To meet internal dernand, NCR ex- 
panded its microelectronics operation 
in 1975 with the addition of a second 
production facility in Colorado 
Springs, Colorado, and in 1979 
added a third facility in Ft. Collins, Col- 
orado. The Colorado Springs facility 
was replaced in 1982 by a new plant 
occupying 100,000 square feet. This 
new plant is one of the most modern, 
best-equipped facilities of its kind any- 
where. 


NCR Microelectronics manufactures 
state-of-the-art NMOS, CMOS, and 
non-volatile SNOS components which 
provide a competitive advantage to its 
computer systems and terminal prod- 
uct lines. 


In mid-1981 NCR announced its entry 
into the merchant semiconductor mar- 
ket. The strength and discipline 
gained in 10 years of internal supply 
is now being made available to our 
customers. This experience, together 
with a family of innovative products 
and services, establishes NCR as a 
leading supplier of semiconductor 
devices and services. 


Copyright © 1985 by NCR Corporation, 
Dayton, Ohio, U.S.A. 
Alt Rights Reserved. Printed in U.S.A. 


Colorado Springs. Colorado 





Fort Collins, Colorado 





Miamisburg, Ohio 
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NMOS Read Only Memory Family (Continued) 


256K ROM 


NCR 23256-15t 
NCR 23256-20 
NCR 23256-25 
NCR 23256-30 
NCR 23256-45 


NCR 23256S-15t 


NCR 232568-20 
NCR 232568-25 
NCR 23256S-30 
NCR 23256S-45 


NCR 23257-15T 
NCR 23257-20 
NCR 23257-25 
NCR 23257-30 
NCR 23257-45 


NCR 232578-15t 


NCR 232578-20 
NCR 232578-25 
NCR 232578-30 
NCR 23257S-45 


Access Time 
Max (ns) 


Supply current 
Max (mA) 


Operating 


Commercial Operating Temperature of 0°C to 70°C is standard for all NCR NMOS ROMs. 


Industrial Operating Temperature of -40°C to 85°C is also available. 


CMOS Read Only Memory Family 





128K ROM 


256K ROM 


512K ROM 


1024K ROM 


NCR 23064-15 
NCR 23C64-20 
NCR 23064-25 


NCR 23C65-15 
NCR 23C65-20 
NCR 23C65-25 


NCR 230 128-15 
NCR 230 128-20 
NCR 23C 128-25 


NCR 230256-15 
NCR 23C256-20 
NCR 23C256-25 


NCA 23C512-15f 


NCR 230512-20 
NCR 23€512-25 


NCR 23C1000-25t 


+Product available 3085 


Commercial! operating temperature of 0° 


Access Time 
Max (ns) 


Standby 


Supply current 
Max (mA) 


25 


C to 70°C is standard tor all NCR CMOS ROMs. 
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Characteristics 


Static/27256 
Static/27256 
Static/27256 
Static/27256 
Static/27256 


Static/Standby 
Static/Standby 
Static/Standby 
Static/Standby 
Static/Standby 


Static/Alt. Pin Out 
Static/Ait. Pin Out 
Static/Alt. Pin Out 
Static/Alt. Pin Out 
Static/Alt. Pin Out 


Static/Standby 
Static/Standby 
Static/Standby 
Static/Standby 
Static/Standby 


Characteristics 


Static/2564 
Static/2564 
Static/2564 


Static/2764 
Static/2764 
Static/2764 


Static/27128 
Static/27128 
Static/27126 


Static/27256 
Static/27256 
Static/27256 


Static/27512 
Static/27512 
Static/27512 


Static 
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Read Only Memories 

NCR offers a full line of high perform- temperature ranges. The NCR NMOS of a major supplier of ROMs in today's 
ance Read Only Memories (ROM) with = and CMOS processes and experience market. Look to NCR for your ROM re- 
a variety of pinouts and access times. in the ROM market allow NCR to pro- quirements to insure that your products 
All NCR ROMs are 5 volt only in both vide fast turnaround of prototype and reach the market place in time for max- 
commercial and industrial operating production quantities plus provide the imum market penetration. 


customer service and support required 


Supply current 
Organization Max (mA) 
Operating | Standby | 


NMOS Read Only Memory Family 





































NCR 2316-20 200 75 24 Static/2716 

NCR 2316-25 250 75 24 Static/2716 

NCR 2316-30 300 75 24 Static/2716 

NCR 2316-45 480 75 24 Static/2716 
75 





Static/2532 





NCR 2332-20 











NCR 2332-25 75 Static/2532 
NCR 2332-30 75 Static/2532 
NCR 2332-45 75 Static/2532 












NCR 2333-20 Static/2732 
NCR 2333-25 Static/2732 
NCR 2333-30 Static/2732 


NCR 2333-45 Static/2732 













































NCR 2364-20 BKx8 Static/2564 
NCR 2364-25 8Kx8 Static/2564 
NCR 2364-30 BKx8 Static/2564 
NCR 2364-45 BKx8 Static/2564 









































NCR 2364S-20 8Kx8 Static/Standby 
NCR 23648-25 8Kx8 Static/Standby 
NCR 23645S-30 8Kx8 Static/Standby 
NCR 2364S-45 BKxB Static/Standby 
NCR 23644-45* Two 4Kx8 Banks Static/Bank Select 


























































NCR 2365-20 8Kx8 Static/2764 
NCR 2365-25 8Kx8 Static/2764 
NCR 2365-30 8Kx8 Static/2764 
NCR 2365-45 BKx8 Static/2764 
NCR 2365S-20 BKx8 Static/Standby 
NCR 2365S-25 8Kx8 Static/Standby 
NCR 2365S-30 BKx8 Static/Standby 
NCR 2365S-45 8Kx8 Static/Standby 


































16Kx8 Static/27128 
16Kx8 Static/27128 
16Kx8 Static/27128 
16Kx8 Static/27 128 
16Kx8 Static/27128 






















































16Kx8 Static/Standby 
16Kx8 Static/Standby 
16Kx8 Static/Standby 
16Kx8 Static/Standby 
16KxB Static/Standby 
Four 4Kx8 Banks Static/Bank Select 
Four 4Kx8 Banks Static/Bank Select 


128K ROM NCR 23128-15t 
NCR 23128-20 
NCR 23128-25 
NCR 23128-30 
NCR 23128-45 
NCR 23128S-15t 
NCR 23128S-20 
NCR 23128S-25 
NCR 23128S-30 
NCR 231288-45 
NCR 23128A-30° 
NCR 23128A-45" 


“Licensed under U.S. Patent Number 4368515 
+ Product available 3Q85 
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Electrically Erasable PROM 


The NCR family of EEPROMs includes are 5 volt only devices with all erase/ 


small organization serial devices for write voltages being generated on chip. 
applications requiring a limited This combination of high density and 5 
amount of storage capability. The volt only operation places NCR in the 
NCR family also includes high density leadership position in EEPROMs. NCR 
by eight devices for applications re- EEPROMs are offered in commercial, 
quiring maximum data storage. All industrial, and military temperature 


members of the NCR EEPROM family ranges. 


Electrically Erasable PROM 


Access Time | Power Supply Operating No. of 
Max (Volts) Range (°C) Pins 
































256 Bit NCR 52801 Oto +70 Serial 
EEPROM NCR 52801 | -40 to +85 Serial 
NCR 59306 Oto +70 Serial 
NCR 59306 |! -40 to +85 Serial 






NCA 52832 Oto +70 28/32* Parailel 
NCA 52832 | —40 to +85 28/32" Parallel 
NCR 52832 HR -§5 to +125 28/32* Parallel 














*28 Pin DIP or 32 Pin LCC 





Non-Volatile RAM 


Non-volatiie RAM (NVRAM) circuits and performs like a static RAM during 
combine high performance static normal operation. During a system 
RAM with electrically erasable PROM power failure the entire contents of the 
on a single integrated circuit. The pri- Static RAM can be stored in the 

mary advantage NVRAMs offer the EEPROM array and are available for 
system designer is its ease of interfac- recall when system power returns to 

ing with a microprocessor without af- normal levels. NCR NVRAMs are 
fecting system performance. This is offered in commercial, industrial, and 
possible because an NVRAM looks military temperature ranges. 
Non-Volatife RAM 





















Access Time Power Supply No. of Operating 
Organization Max (ns) (Volts) Pins Range (°C) 
256 Bit NVRAM NCR 52210 64x4 300 +5 18 Oto +70 
NCR §2210 | 64x4 300 +5 18 -40 to +85 
NCR 52210 HR 64x4 450 +5 18 -§5 to +125 
512 Bit NVRAM NCR 52211 428x4 300 +5 18 Oto +70 
128x4 300 +5 18 
450 +5 18 
1K NVRAM 18 
1K NVRAM 



























NCR 522111 -40 to +85 
NCR 52211 HR -55 to +125 


NCR 52212 Oto +70 
NCR 52212 1 -~40 to +85 
NCR 52212 HR -55 to +125 


NCR 52001 Oto +70 
NCR 52001 | -40 to +85 
NCR 52001 HR 55 to +125 






















































2K NVRAM NCR 52002 Oto +70 
NCR 52002 | -40 to +85 
NCR 52002 HR -55 to +125 





NCR 52004 
NCR 52004 | 
NCR 52004 HR 









Oto +70 
-40 to +85 
-§5 to +125 








4K NVRAM 
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Design and Applications 
Assistance 


You have the option of using one of 
the Semicustom Design Centers for 
design services and suppor, or you 
may prefer to purchase a worksta- 
tion and design your device in your 
facility. In either case, NCR will pro- 
vide full engineering support. 


Options inciude: 

¢ completing design verification in 
your facility 

designing the device at the NCR 
facility or an NCR Design Center 
permitting NCR or an NCR De- 
sign Center to perform design 
verification and provide a device 
which meets your logic specifica- 
tions 


CAD Tools 


Semicustom design and development 
are done with the most sophisticated 
tools available. NCR is committed to re- 
taining the position of industry leader in 
technology, applications support and 
service. To meet this commitment, NCR 
has acquired and/or developed the 
best CAD tools that the industry can 
offer. 


Engineering Workstation Support 


NCR is a leader in the suppor of the 
most popular and powerful engineering 
workstations. Presently, you have your 
choice of the Daisy™ or Mentor 
Graphics™ Workstations, and in 1985 
the Valid™ Workstation. All of these 
workstations have powertul user- 
friendly software which is well suited to 
the design of semicustom integrated cir- 
Cuits as well as for other applications. 
NCR has ported its proprietary 
software, such as VITA™, to these work- 
stations and interfaced to the resident 
software. This means that you can per- 
form total design capture and verifica- 


tion on the workstation without the need 
for any resimulations by NCR. In addi- 
tion, NCR has developed documenta- 
tion specific to each workstation to 
guide you in the use of the NCR Semi- 
custom Design and Verification Sys- 
tem™ with the commands and proce- 
dures specific to that workstation. NCR 
Design Centers and applications engi- 
neers are available full-time to assist you 
in every phase of design and develop- 
ment, including hands-on training on a 
workstation. NCR is actively involved 
with engineering workstation industry 
leaders to continue the evolution of de- 
sign capabilities and tools. 


Timing Anatysis 


For timing analysis, NCR has devel- 
oped the VITA™ (VLSI Timing and tnter- 
connect Analysis) package of pro- 
grams. NODE DELAY and PATH 
DELAY feature user prompts and keep 
track of signal names for ease of use. 
PLUG DELAY provides feedback to 
logic simulators for "realtime" simula- 
tions. These programs can be run both 
before layout, using estimated intercon- 
nect capacitances, and after layout, us- 
ing extracted interconnect RC values, 
and rise/fall effects on cell delays. 


For analog simulations, NCR will pro- 
vide SPICE models for the cells, and full 
characterization data sheets. 


Layout 


Layout, using NCR enhancements to in- 


dustry standard auto-place-and-route 
(APR) programs, has become a stream- 
lined activity producing excellent re- 
sults. Customers have the option of hav- 
ing NCR perform the layout from a pro- 
vided netlist and specifications, or by 
obtaining an industry standard APR for 
in-house use. NCR is also cooperating 
with industry efforts to develop APR 
capability on engineering workstations. 


Tegas™ is a registered trademark of General Electric—CALMA Co. 
VITA™, SENTPEX™, Semicustom Design, and Verification System™ 


are registered trademarks of NCR Corporation 


CAL--MP™ is a registered trademark of SILVAR-LISCO. 

Daisy™ is a registered trademark of Daisy Systems, Inc. 

Mentor Graphics™ is a registered trademark of Mentor Graphics Corporation. 
Valid™ 1g a registered trademark of Valid Logic Systems, Inc. 


Test Program Generation 


NCR developed the SENTPEX™ (Sentry 
Test Pattern Extractor) package of pro- 
grams. This software checks simula- 
tions of workstations or TEGAS™ V for 
compatibility with industry standard IC 
testers, converts them to tester format 
and compresses the patterns. The re- 
sults are combined with DC parameters 
and compiled to generate the test pro- 
grams used in prototype testing and 
production testing of the device. 


CAD Software Tools 


« Schematic entry and check 
® Netlist extraction 
e Logic simulation—TEGAS™ V and/or 
workstation-based simulation, to ver- 
ify functionality and provide vectors 
for testing the device 
© Timing Analysis—The VITA™ (VLSI 
interconnect and Timing Analysis) 
package uses both estimated inter- 
connect loading and extracted inter- 
connect RC loading and rise/fall ef- 
fects to accurately model signa! 
delays. It calculates path delays as 
well as providing timing information 
to include in logic simulation. 
Automatic Piace and Route—CPR3 
and CAL-MP™ optimize placement of 
celis and automatically route the en- 
tire circuit, taking into account any 
specified critical paths. 
Layout Verification—includes com- 
parison of the netlist extracted from 
the layout to the original netlist to ver- 
ify accuracy and eliminate all poss- 
ble layout errors, ERCs (Electrical 
Rule Checks) and DRCs (Design 
Rule Checks). 
Fault Grading—verifies test pattern 
quality; performed primarily with 
TEGAS.™ 
Test Pattern Generation— 
SENTPEX™ package checks simula- 
tion pattern compatibility with testers, 
converts and compresses the pat- 
terns and compiles the test program. 


NCR Semicustom Design 


NCR Semicustom Design offers you 
the same high performance, design 
flexibility and breadth of functions as 
a fully-customized integrated circuit, 
while simultaneously minimizing 
development time and cost. Key ele- 
ments of the NCR system include 
computer-aided design (CAD) tools, 
advanced process technologies. to- 
tal technical support and a wide se- 
lection of cell functions in a 
state-of-the-art CMOS standard cell 
library. 

You can take the lead in design and 
development with NCR technical ex- 
pertise and foundry facilities to aid 
you in finding and implementing the 
optimal solution to your needs. 
Every phase of the design and 
development process is followed up 
with the NCR state-of-the-art support 
system, permitting more freedom 
and security to explore alternatives 
at minimal cost. 


e Performance—propagation delays 
less than LSTTL and HCMOS 
technologies 


e Advanced process technology— 
low power CMOS 


@ Directly TTL and HCMOS 
Compatible—no interface or pull- 
ups required 


e Sophisticated CAD System— 
minimizes risk while easing and 
speeding design providing a first 
pass working part 


¢ Optional ROM, Static RAM. and 
PLA — Customer definable in size 
and organization, with the option 
of analog and a core micropro- 
cessor on the same chip 

® Silicon Efficient—no fixed-routing 
channels or cell locations. NCR 
Semicustom Design aliows close 
packing of high-level functions for 
minimum die size and lowest 
overall cost of any semicustom 
solution 

® Many 7400/5400 equivalent func- 
tions 

e Versatile in-house assembly capa- 
bility for plastic and ceramic dual- 
in-line and chip carrier package 
types 
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Standard Ceti Microcomputer with Core Microprocessor, Sound Generator, 
HO and Random Cell Logic. 


Cost 


Compared to discrete logic, the use 
of an NCR cell library device to inte- 
grate system logic greatly reduces 
system power requirements, board 
space, component cost, manufac- 
turing cost, weight, and overhead 


SCHEMATIC ENTRY 


DESIGN VERIFICATION: 
SIMULATION AND 
TIMING ANALYSIS 


PATTERN CHECKS, LAYOUT CHECKS 
CONVERSION. 
COMPRESSION, 


FAULT GRADE OPTION 


AND COMPARE 


TOOLING AND 
PROTOTYPE 
FABRICATION 


COMPILE TEST 
PROGRAM 


NETLIST EXTRACT 


RULES CHECKING 


costs such as rework, inventory and 
purchasing. Reliability and perform- 
ance will also be improved. All these 
factors directly impact unit pricing, 
particularly in volume production, 
making a cell library device a man- 
datory design choice. 


DESIGN | 
PHASE “ 
2:10 
WEEKS ] 
EXTRACTED 
INFORMATION 


TOOLING 
AND 
PROTOTYPING 
4-8 WEEKS 


PROTOTYPE APPROVAL 


PRODUCTION 
RAMP-UP 
812 WEEKS 


VOLUME PRODUCTION 


NCR SEMICUSTOM DESIGN AND VERIFICATION SYSTEM 
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Digital Signal 
Processing 


NCR now offers two DSP VLSI devices: 
the NCR45CG72 is the Geometric 
Arithmetic Parallel Processor chip 
(GAPP) and the NCR45CM 716 is the 
Multiplier Accumulator chip. Both of 
these devices are targeted for emerging 
digital signa! processing applications. 
The 45CM16 is aimed at 
microprocessor-based systems that 
perform multiply intensive tasks. Exam- 
ples inciude process control, robotics, 
and electronic instruments. The GAPP 
is well Suited for applications in which 
operations are repetitively applied over 
large arrays of data. This includes many 
image processing applications such as 
pattern recognition, automatic inspec- 
tion, convolution, correlation, data com- 
pression, and machine vision. 


Geometric Arithmetic 
Parallel Processor 


FEATURES 


* 6x 12 systolic array of processors in 
CMOS VLSI 

@ Highly parallel architecture 

¢ Nearest neighbor communication 
between processors 

© GAPP devices fully cascadeabie 

e Overlapped I/O and computation 

e On-chip 128-Bit SRAM per processor 


The GAPP is a revolutionary architec- 
ture that is comprised of 72 individual 
processors elements arranged in a 6 x 
12, two-dimensional array. Each of the 
processors operates in parallel with 
each processor being able to manipu- 
late different data. The massive parallel- 
ism inherent in the chip's architecture 
provides the processing power of 72 
processors on a single piece of silicon. 


Geometric Arithmetic Parallel Processor (GAPP) 


Within each processor is a bit-seria! 
ALU, 128 bits of RAM, and four single- 
bit latches. Three of these latches hold 
inputs to the ALU and the fourth jatch al- 
lows I/O operations to be performed 
without interrupting the program execu- 
tion. Thus, I/O operations can be over- 
lapped with computation. Each of these 
processors is able to communicate and 
exchange data with its four immediate 
neighbors: one to the East, West, North, 
and South. 


GAPP chips are cascadeable and allow 
system designers to implement proc- 
essor arrays of arbitrary size in multiples 
of 6 x 12 elements. For instance, two 
GAPP chips can be configured to torm 
a 12 x 12 processor array, eight chips 
can be used to form a 24 x 24 array of 
processors, and so on. The advantage 
of cascading arrays of GAPP chips in 
systems is that system throughput in- 
creases linearly with the number of 
chips used in the system. Thus, a sys- 
tem of two GAPP chips offers twice the 
processing throughput of a single 
GAPP chip, while a system of eight 
chips offers eight times the processing 
throughput of a single GAPP chip and 
four times the processing throughput of 
atwo GAPP chip system. This ability to 
trade off performance versus chip count 
offers the system designer virtually un- 
limited freedom in designing systems 
around the GAPP to meet specific per- 
formance needs. In addition, software 
compatibility can be maintained as sys- 
tem designers expand their systems by 
adding more GAPP chips to increase 
system performance. 
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The GAPP architecture is typically 
described by such terms as “systolic 
array,’ or SIMD (Singie Instruction, Multi- 
ple Data). Regardiess of how one 
describes it, the GAPP is an undeniable 
departure from the traditional 
vonNeumann architecture which pro- 
cesses data utilizing a single data ele- 
ment. The vonNeumann architecture, 
for example, depends upon component 
technology to attain processing 
throughput. The GAPP. on the other 
hand, exploits parallelism rather than re- 
lying on Component speed to achieve 
its throughput. Hence, the GAPP is able 
to achieve throughput rates unattain- 
able by vonNeumann architectures. 


NCR Semicustom Process 
Technology 


The NCR fine- geometry CMOS process 
provides excellent performance. Op- 
tions include precision capacitors for 
analog and double level metal. NCR's 


CMOS is immune to most latch-up situa- 


tions with protection of 90 mA at 12V. 
Worst case ESD (electrostatic dis- 
charge) is rated at 3.0kV. NCR's CMOS 
technoiogy has proven to be a very re- 
liable high volume process which pro- 
vides circuit densities and perform- 
ances which are extremely competitive 
in today’s market. 


Manufacturing 


Whether your semicustom design Is 
performed by NCR, a design house, or 
yourself, NCR will complete your device 
development, produce the masks and 
fabricate the waters in-house. 


Assembly 


NCR's fast-turn assembly facility permits 
short development cycles and rapid 
ramp-up for initial production. In-house 
packaging includes plastic and ceramic 
DIPs and chip carriers. Off-shore pack- 
aging capabilities offer high volume 
economies on al! packaging alterna- 
tives. 


Second Source 


NCR maintains an extensive second 
source agreement with Standard Micro- 
systems Corporation which enables 
customers to activate second-sourcing 
at any point during the design, develop- 
ment or manufacturing process. 


NCR CMOS II Digital Cell Library 


The variety of cells offered allows for op- 
timization of silicon area. A smaller die 
size means better performance and 
lower costs. 


SSI Functions: 


e Buffers and Inverters 
—drive and tristate options 

e NAND and NOR 
—available with 2,3,4 inputs 


e AND and OR—up to 8 inputs 

AO}, OAI, EXOR 

e« “Combinational” logic cells 
—for denser and faster devices 

© Delay Cells 

® Two-phase Clock Driver 


Flip-Flops/Latches: 


e Cross coupled iatches 
both NOR and NAND 
e Level sensitive transparent latches 
with Reset 
without Reset 
with clock driver 
e Fdge triggered D Flip-flops 
with Reset 
with Set and Reset 
without Set and Reset 
with clock driver, Set and Reset 
e Edge triggered JK flip-flops 
with Set and Reset 
with Set, Reset and clock driver 


MSI Functions: 


e Single-bit cascadeable loaded shift 
register with serial or parallel in, and 
serial out, with or without clock driver 


e Single-bit cascadeable, loadable, up- 
down counter with Reset and Enable, 


carry in and carry out 


Input/Output Pads and Buffers 


Options give optimal size in pad-limited 
designs. Levels are directly TTL and 
CMOS compatibie. 


e input Cells—choice of standard TTL 
or variety of Schmitt trigger levels 

e Output Cells—variety of drive op- 
tions, open drain, pullup options 

® Tristate—combination of I/O options 


CMOS II Analog Cell Library 


Op Amps 

Comparators 

Analog Switch 

Bandgap Voltage References 
Oscillators 

D/A Converters 

A/D Converters 

Flash A/D Converter 

Sound Generator 

Negative Supply Generators 
Bias Generators 

Logic Level Shifter 
Power-On-Reset 


CMOS I! Supercell Library 


¢ Modular ROM 

e Modular RAM 

e Modular PLA 

© Counter/Timer 

© 65CX02 Core-microprocessor 


Gate Array Technology 


Gate arrays are a viable option if you 
have alow volume design or one re- 
quiring fewer functions and therefore 
fewer gates. Design and development 
cycles are customarily shorter and less 
costly for gate arrays. The trade-off is in 
design flexibility and production costs, 
since a Cell library device is smaller and 
less costly in larger production quanti- 
ties. 


NCR design engineers will assist you in 
making the most cost-effective decision 
to meet your needs, whether it is a cell 
library device or a gate array. 


NCR Quality Assurance 


The NCR Microelectronics goal in all de- 


sign projects is to meet or exceed the 


customer's quality and reliability require- 


ments by building quality in. Each of 
NCR's processes and products has 
been extensively characterized and 
qualified. Design Assurance Engineers 
have worked closely with Standard Cell 
Designers and Computer-Aided Design 
Software Engineers to help assure first 
pass design success for all customers 
using Standard Cells. Each cell has 
been fully characterized and subjected 
to the same rigorous reliability testing 
used to qualify the process itself. In ad- 
dition to the initial qualification, the Qual- 
ity Assurance Department samples 
parts from each product and performs 
on-going reliability testing to maintain a 
high level of confidence in fabrication 
and assembly operations. Each part re- 
ceives full functional testing and visual 
inspection prior to shipment. 

As a result of exceedingly high stand- 
ards and the desire to be a leader, NCR 
Microelectronics has one of the jowest 
part reject records in the industry. 
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NCR/32 
Processor Family 


Features 

* 32-bit system architecture 

* 13.3 Megahertz frequency 

e Effective emulation of mid-range 
mainframes 

Externally microprogrammable 
Real and virtual memory operation 
Large direct memory addressing 
Interface provided to slower periph- 
erals 

® On-chip error check and correction 


Functional Description 


The NCR/32 VLSI Processor family 
combines the latest advances in semi- 
conductor technology with experi- 
ence gained in three generations of 
computer mainframe design to pro- 
vide a comprehensive microprogram- 
mable 32-bit system architecture. With 
external microprogram capability, an 
extremely flexible microinstruction set, 
and a powerful set of internal regis- 
ters, the NCR/32 offers flexibility and 
high performance advantages not 
available with other microprocessors. 


Along with an existing set of VLSI 
family support devices, the NCR/32 
offers effective emulation of register, 
stack and descriptor-based system ar- 
chitectures, as well as execution of 
high-level languages directly from mi- 
crocode. The NCR/32 is wel! suited 
for applications requiring direct ad- 
dressing of a large memory space, 
high numeric precision, and very- 
high-speed execution such as bit- 
mapped graphics, robotics, artificial 
intelligence, and relational data 
bases. 


NCR/32 
FAMILY ARCHITECTURE 
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NCR'32-010 ATC 


The NCR/32 VLSI Processor family 
consists of the Central Processor Chip 
(CPC), the Adaress Transiation Chip 
(ATC), and the System Interface Con- 
troller (SIC). Additional members of 
the family include the Extended 
Arithmetic Chip (EAC), the System In- 
terface Transmitter (SIT) and Receiver 
(SIR) chips, and the Bus Assist Chip 
(BAC). 


The CPC performs the basic micro- 
processing function using four 32-bit 
internal data paths, cormplemented by 
two independent external data paths: 
the 32-bit Processor Memory (PM) 
Bus and the 16-bit Instruction Storage 
Unit (ISU) Bus. An integral part of the 
CPC is the Arithmetic Logic Unit 
(ALU) which is used for performing 
decimal and binary arithmetic func- 
tions and logical operations. There are 
two sets of registers in the CPC. The 
Register Storage Unit consists of 16, 
32-bit registers used for storage and 





manipulation of data; the additional 22 
registers of the Internal Register Unit 
are used as jump address registers 
and operand pointer registers. A 
three-stage pipeline insures that one 
microinstruction is being fetched, 
another read, and a third executed in 
the same time frame. 


The system clock is a two-phase, 
non-overlapping clock operating at 
13.3MHz. This yields a 150 nanose- 
cond clock cycle with 90% of the mi- 
croinstructions executing in one cycle. 


GAPP Development System 


To support software development, there 
is a GAPP Evaluation Module which 
consists of a software development 
package and hardware accelerator 
board for |BM compatible personal 
computers. Development software for 
the Evaluation Module includes the 
GAPP Algorithm Language compiler, 
and a program debugger which allows 
single and multipie step execution. in 
addition, the programmer is able to ex- 
amine and change internal registers 
and RAM locations in each processor 
element. 


Also available is a GAPP Simulator/ 
Assembler which allows the program- 
mer to simulate GAPP programs on 
processor arrays of arbitrary size. The 
Simulator/Assembler allows the user to 
write and debug programs in GAPP 
micro-code, and examine internal regis- 
ters and RAM iocations, 


16 x 16 Single Port Multiptier/ 
Accumulator Chip 


FEATURES 


e 24-pin ceramic or plastic DIP 

40-bit accumulator 

190ns cycle time (typ) 

Fully static operation—no clock 

required 

¢ Single port allows easy interfaces to 
microprocessor bus 


The NCR45CM 16 is a 24-pin CMOS 
multiptier/‘accumulator chip for use with 
16-bit microprocessor systems. All input 
and output data are transferred through 
a single 16-bit bidirectional data bus in 
signed two's complement format. This 


device is TFL/CMOS compatible and re- 
quires no clock due to its totally static 
(asynchronous) operation. The 
45CM16 may be attached to a micro- 
processor bus in a way similar to a 16- 
bit wide static RAM. 


The single port design of the 45CM16 
makes it much more compact than 
three port devices. Another compara- 
tive advantage of the 45CM16 relative 
to three port multiply/accumulate chips 
is that there is no need to use a lot of 
glue logic to interface it to the micro- 
processor bus. Static operation frees 
the system designer from having to 
generate clock signals to control the 
device. These three attributes: small 
package, ease of interface to micropro- 
cessor bus, and static operation mean 
that boards designed with the 45CM16 
will be more compact and easier to de- 
sign. 

An 8086 or 68000 using the 45CM16 
can realize a 3X enhancement in overall 
multiplication speed compared with 
performing the multiplication operation 
in software using the 68000 instruction 
set. The 40-bit accumulator allows 32- 
bit partial products to be accumulated 
up to 256 times before the contents of 
the accumulator must be read. 
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NCR/32-796A 
Board 


Board Highlights 


* Dual port main memory access ca- 
pability using either Multibus or 
iLBX* 

Full 32-bit VLSI Chip Set 

—Central Processor Chip (CPC) 

~- Address Transiation Chip (ATC) 
— Extended Arithmetic Chip (EAC) 
150ns instruction/PM bus cycles 
Real and virtual memory operations 
On-board breakpoint and movable 
window trace capability 

4K words of ROM containing 
diagnostics and debug routines 
16K words of on-board RAM for 
user-defined microcode 


The NCR/32-796A board, featuring the 
NCR/32 Chip Set, provides new oppor- 
tunities for microcode generation at the 
microprocessor level. The board pro- 
vides an alternate iLBX I/O port for high- 
speed memory transfers. A wide range 
of user applications include: 

Dedicated algorithmic processing 
File processing in intelligent networks 
Graphics co-processing 

Robotics control 

Virtual machine emulation 

High-level language acceleration 
Image recognition. 


The NCR/32-796A board includes an 
instruction Storage Unit (ISU) providing 
16K words of storage for user micro- 
code. Use of the Extended Arithmetic 
Chip (EAC) offers the following math ca- 
pabilities: single and double precision 


fixed-point binary multiplication and divi- 


sion, single and double precision 
floating-point hexadecimal (IBM format), 
floating-point decimal, and format con- 
version. 


Resident microcode-development 
firmware makes breakpoint and trace 
logic readily accessible via onboard 
ROM. Additional development interface 
and assembler software is also availa- 
ble. 


*Multibus and iLBX are tradernarks of Intel Corporation. 
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The ATC provides memory manage- 
ment functions using either virtual or 
real memory addressing. To support 
virtual memory operations in the NCR/ 
32 chipset, an extra PM bus cycle 
precedes the standard memory ac- 
cess. Two 32-bit registers, the TOD 
Register/Counter and the Interval Ti- 
mer Monitor Register, are used for 
time interval monitoring. An NCR- 
patented ‘scrubbing’ technique 
checks, and corrects if necessary, a 
64K word block of memory every 
1.048 seconds. The ATC has three 
virtual address page sizes: 1K, 2K, 
and 4K bytes. 


The EAC is a performance booster 
used during arithmetic operations. 
Fixed point, decimai, and hexadeci- 
mal floating point formats are all 
handled by the EAC. (Hexadecimal 
floating point format is compatible with 
the 1BM/370.) Results are in either 
single (one word) or double (two 
words) precision. Conversion opera- 
tions between formats are also 
handled. 


The SIC performs communication 
management between the NCR/32 
chipset and the I/O devices. Used 
with the SIT and SIR (which perform 
data format conversions) the SIC 
sends and receives messages at up 
to 24 megabits per second per chan- 
nel. The SIC/SIT/SIR communications 
subsystem operates in either Data 
Link Control mode or Local Area 


Network mode. In the Data Link Con- 
trol mode, the SIC has access to eight 
transmission channels through a polli- 
ing scheme. This mode is designed to 
contro! multiple peripheral devices on 
a system. The Local Area Network 
mode is designed for high-speed 
transmissions in a network environ- 
ment, using two different channels of 
access. 


The NCR/32 Development System is 
available to help in evaluating the 
NCR/32 chipset and in developing 
microcode for particular system appii- 
cations. A complete development sys- 
tem consists of two NCR components, 
the NCR/32-796A Board and the 
NCR/32 Debug Monitor along with the 
following: 


¢ An|BM-compatible PC 
¢ Arelocatable, linkable assembler 
* A Multibus™ development environ- 
ment, including: 
—a chassis 
—an adapter kit consisting of a Multi- 
bus board and aPC board 
—a memory board 


in addition, experienced NCR applica- 
tions engineers can assist in determin- 
ing the suitability of the NCR/32 family 
for solving applications problems. 
These engineers can provide extensive 
training on the NCR/32 systems archi- 
tecture, individual chips, and the use of 
design support tools. 


“Multibus is a registered trademark of Intel Corp. 
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NCR Microprocessors 


NCR 6518 8-bit Microprocessor utilizing the 6507 CPU. 

Contains 128x8 Static RAM, two bi-directional programmable I/O ports, programmable 
interval timer. 

NCR 65C02 8-bit Microprocessor, software compatible with the NMOS 6502. 2 or 3 MHz operation, 
64K-byte addressable memory, low power consumption 4mA @ 1 MHz. 

NCR 65C21 Peripheral Interface Adapter, with two 8-bit bidirectional I/O ports, and four peripheral 
contro//interrupt input lines. 

NCR 65C22 Versatile interface adapter with internal timer/counters. Compatible with NMOS 6522. 
Two powertul 16-bit programmable internal timer/counters, Latched input/output regis- 
ters on both I/O parts. 

NCR 65CX02 Identical to 65C02 except for the addition of four bit manipulation instructions (SMB, 
RMB, BBS, BBR). Will operate at 2, 3, or 4 MHz. 

Microcomputers 


NCR 6500/1 


NCR 6500/11 


NCR 65C00/1 


NCA 65C00/2 


NCR 65C00/3 





All parts have I/O capabilities of 32 bi-directional lines, are powered by a 5V power supply, 
and are packaged in a 40 pin DIP. 
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16-bit 1,2,3 MHZ 
Programmable 


{2) 16-bit 1 or 2 MHz Full Duplex UART 
Programmable 10 Interrupts 


16-bit 1,2,4 MHz Low Power 
Programmable 4mA/MHz Max 
ImMA/MHz Typical 


16-bit 1,2,4 MHz Low Power 
Programmable 4mA/MHz Max 
tmMA/MHz Typical 


1&-bit 1,2,4 MHz Low Power 
Programmable 4mA/MHz Max 
1mA/MHz Typical 


Special Function Chips 
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scsi 
NCR 5380 


SCSI Protocol 
Controller 


NCR 5385E 
SCSI Protocol 
Controller 


NCR 5386 
SCSI Protocol 
Controller 


Supports latest ANS/ X 3T9.2 SCSI draft-proposed standard. Asynchronous data trans- 
fers to 1.5 Megabytes/sec. Operates in both initiator and target roles. Supports arbitra- 
tion including reselection. Contains on-chip open collector (48 mA at.5V) bus 
transceivers. Requires +5V supply in a 40 pin DIP 


Enhanced 5385 supports the latest ANSI X 379.2 SCS! Standard. Asynchronous data 
transfers to 1.5 Megabytes/Sec. Operates in both initiator and target roles. Supports arbi- 
tration including reselection. Uses external open collector or differential pair transceivers. 
Double buffered data registers, 24-bit transfer counter and automatic Protocol handling 
Provides high performance interface. Requires + 5V supply in a 48 pin DIP. 


Replacement for NCR 5385. Updates all SCSI timings to latest ANS! specification with 
operational enhancements. Production June '85. 
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Graphics 


NCR 7250 
CRT Controller 


NCR 7300 
Color Graphics 
Controller 


NCR 7301 
Memory 
Interface 
Controller 


On-chip character ROM with 192 characters. Addresses a 2Kx8 video RAM. Generates 
VSYNC, HSYNC and VIDEO to interface directly with CRT monitor. Eight screen and 
six field functions are under software control. Dot clocks up to 20MHz with +5V supply 
in @ 40 pin DIP. 


Translates high level commands from host computer into video operations such as 
drawing and text manipulation, and provides video output to monitor. Supports a dis- 
playable screen resolution of 640x480 pixels at 60Hz, and a frame buffer of 

1024x1024. Has analog RGB outputs, and pixel rates to 30 MHz. Interfaces to 8-bit or 
16-bit processor. Housed in 68 pin package and uses +5V supply. 


Companion chip to NCR 7300. Multiplexes and Demultiplexes between four and six- 
teen bit busses. Designed for implementation of high pertormance graphics systems 
and similar applications requiring rapid data handling. Requires +5V supply in a 28 
pin DIP. 
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Other 


NCR 8301 
Bar Code 
Processor 


NCR 8489 
Sound 
Generator 


Decodes code 39 and interleaved 2 of 5, bidirectional decoding, velocity of 1 to 50 in/ 
sec with 32-character tag buffer. Standalone or peripheral mode with +5V supply ina 
40 Pin DIP. 


Functionally and pin compatible with SN76489A. Three programmable tone generators. 
Programable white noise generator with 4 MHz (max) clock input. Requires +5V sup- 
ply in a 16 pin DIP. 
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Eastern Area 

NCR Microelectronics Division 
400 W. Cummings Park 

Suite 2750 

Woburn, MA 01801 

Phone: (617) 933-0778 


Non-Volatile Memories 
Read-Only Memories 

NCR Microelectronics Division 
8181 Byers Road 
Miamisburg, OH 45342 
Phone: (800) 543-5618 


NCR Microelectronics 
Area Sales Offices 


Central Area Western Area 

NCR Microelectronics Division NCR Microelectronics Division 
400 Chisholm Place 4655 Old lronsides Drive 
Suite 100 Suite 400 


Plano, TX 75075 
Phone: (214) 578-9113 


Santa Ciara, CA 95050 
Phone: (408) 727-6575 


NCR Microelectronics 
Manufacturing Plants 


Semicustom Design 

Digital Signal Processing 

NCR Microelectronics Division 

2001 Danfield Court 

Fort Collins, CO 80525-2998 
Phone: (303) 226-9500 or 223-5100 


(513) 866-7471 in Ohio or 
International 
Telex: 241669 NCR NVMEM MSBG 


Telex: 45-4505 NCRMICRO FTCN 


Microprocessors/Peripherals 
NCR Microelectronics Division 
1635 Aeroplaza Drive 
Colorado Springs, CO 80916 
Phone: (800) 525-2252 
(303) 596-5611 in Colorado or 
International 
Telex: 452-457 NCR MICRO CSP 


NCR Microelectronics 





Design Centers 


Semicustom Design Center Locations 

Aptek Microsystems integrated Circuit Systems, Inc. 

700 N.W. 12th Avenue 1012 W. Ninth Avenue 

Deerfield Beach, FL 33441 King of Prussia, PA 19406 

(305) 421-8450 (215) 265-8690 

Contact: Trygve (Tryg) Ivesdal Contact: Ed Arnold or Jere Hohmann 


Custom Silicon, Inc. Ontario Centre for Microelectronics 
600 Suffolk Street 1150 Morrison Drive 

Lowell, MA 01854 Suite 400 

(617) 454-4600 Ottawa, Canada K2H9B8 

Contact: David Guinther (613) 596-6690 


ae Contact: Dr. Karl Siemens 
Design Engineering, Inc. 


1900 13th St.. Suite 304 
Boulder, CO 80302 
303/440-7997 

Contact: Steve Davis 


Manhattan Skyline, Ltd, 
United Kingdom 
Manhattan House 

Bridge Road 

Maidenhead 

Berkshire SL6 8DB 
England 

Maidenhead (0628) 75851 
Contact: Stu Kitchiner 


Array Technology 

1297 Parkmoor Avenue 
San Jose, CA 95126 
408/297-3333 

Contact: Dan Weed 
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